Developer Guide: Extending the Benchmark Framework

This guide explains how to extend the llm-d-benchmark framework with new steps, analysis modules, harnesses, scenarios, and experiments. All code references are accurate to the current codebase.

1. Architecture Overview
2. How to Add a New Step
- Step-by-Step Guide
- Complete Examples
3. Step Execution Flow
4. ExecutionContext Deep Dive
5. How to Add a New Analysis Module
6. How to Add a New Harness
- Required Files
- Harness Pod Template
7. How to Add a New Scenario (Well-Lit Path)
8. How to Add a Smoketest Validator
9. How to Add a New Experiment

1. Architecture Overview

Step-Based Execution Model

The framework is built around a Step abstraction (defined in llmdbenchmark/executor/step.py). Every unit of work -- deploying a namespace, running a smoketest, collecting results -- is a Step subclass with a number, name, phase, and an execute() method.

Four-Phase Lifecycle

A benchmark run proceeds through four phases, each with its own ordered list of steps:

Standup (Phase.STANDUP) -- Provisions infrastructure, deploys models. Steps 00-09 in llmdbenchmark/standup/steps/. Note: smoketest steps (formerly steps 10-11) have been moved to llmdbenchmark/smoketests/ and run as a separate phase after standup.
Smoketest (Phase.SMOKETEST) -- Post-deployment validation: health checks, inference tests, per-scenario config validation. Steps 00-02 in llmdbenchmark/smoketests/steps/.
Run (Phase.RUN) -- Detects endpoints, renders workload profiles, deploys harness pods, waits for completion, collects and analyzes results. Steps 00-11 in llmdbenchmark/run/steps/.
Teardown (Phase.TEARDOWN) -- Uninstalls Helm releases, cleans harness resources, deletes cluster-scoped objects. Steps 00-04 in llmdbenchmark/teardown/steps/.

Experiment Orchestrator

The experiment subcommand (implemented in _execute_experiment in llmdbenchmark/cli.py) drives a Design of Experiments (DoE) matrix. For each setup treatment it:

Renders plans with config overrides from the experiment YAML
Runs the full standup phase
Runs all run treatments (each is a separate run phase invocation)
Tears down the stack

Results are collected into experiment-summary.yaml.

How StepExecutor Partitions Steps

StepExecutor (in llmdbenchmark/executor/step_executor.py) sorts all steps by number, then partitions them into three groups:

Pre-global: Global steps (per_stack=False) whose number is lower than the lowest per-stack step. These run sequentially before any per-stack work.
Per-stack: Steps with per_stack=True. These run sequentially within each stack but in parallel across stacks (up to max_parallel_stacks, default 4) using ThreadPoolExecutor.
Post-global: Global steps whose number is higher than the lowest per-stack step. These run sequentially after all per-stack work completes.

pre-global steps (sequential)
    |
    v
per-stack steps (parallel across stacks, sequential within each stack)
    |
    v
post-global steps (sequential)

ExecutionContext

ExecutionContext (in llmdbenchmark/executor/context.py) is a @dataclass that carries all shared state through every step and phase. Steps receive it as the first argument to execute(). Key contents:

Paths: plan_dir, workspace, base_dir, rendered_stacks
Execution flags: dry_run, verbose, non_admin, current_phase
Cluster info: cluster_url, kubeconfig, is_openshift, is_kind, cluster_name, context_name
Namespace info: namespace, harness_namespace
Deployed state: deployed_endpoints, deployed_methods, deployed_pod_names, experiment_ids
Run configuration: harness_name, harness_profile, harness_output, harness_parallelism
Shared logger and command executor: logger (implements LoggerProtocol), cmd (CommandExecutor)

2. How to Add a New Step

Step-by-Step Guide

1. Create the step file

Place it in the appropriate phase directory. Follow the naming convention step_NN_descriptive_name.py, where NN is a two-digit step number:

llmdbenchmark/standup/steps/step_11_my_custom_step.py
llmdbenchmark/run/steps/step_12_my_custom_step.py
llmdbenchmark/teardown/steps/step_05_my_custom_step.py

2. Extend the Step base class

Every step must:

Call super().__init__() with number, name, description, phase, and per_stack
Implement execute(self, context, stack_path=None) -> StepResult
Optionally override should_skip(self, context) -> bool

Here is the base class signature:

class Step(ABC):
    def __init__(
        self,
        number: int,
        name: str,
        description: str,
        phase: Phase,
        per_stack: bool = False,
    ):
        ...

    @abstractmethod
    def execute(self, context: ExecutionContext, stack_path: Path | None = None) -> StepResult:
        ...

    def should_skip(self, context: ExecutionContext) -> bool:
        return False

3. Implement execute()

The method must return a StepResult. Collect errors into a list, then return success or failure:

def execute(self, context: ExecutionContext, stack_path: Path | None = None) -> StepResult:
    errors: list[str] = []
    cmd = context.require_cmd()

    # Do work...
    result = cmd.kube("get", "pods", "--namespace", context.require_namespace(), check=False)
    if not result.success:
        errors.append(f"Failed to list pods: {result.stderr}")

    if errors:
        return StepResult(
            step_number=self.number,
            step_name=self.name,
            success=False,
            message="Step failed",
            errors=errors,
        )

    return StepResult(
        step_number=self.number,
        step_name=self.name,
        success=True,
        message="Step completed successfully",
    )

4. Implement should_skip()

Return True to skip the step based on context. For example, the run preflight step skips when only collecting results:

def should_skip(self, context: ExecutionContext) -> bool:
    return context.harness_skip_run

5. Choose per_stack=True vs per_stack=False

`per_stack`	Behavior	When to use
`False` (default)	Runs once, globally. `stack_path` is `None`.	Cluster-wide operations: namespace creation, Helm uninstalls, preflight checks
`True`	Runs once per rendered stack. `stack_path` is the path to that stack's rendered config directory.	Operations that differ per deployment topology: smoketests, per-stack deploys

When per_stack=True, your step must handle a non-None stack_path and can load per-stack config via self._load_stack_config(stack_path).

6. Register the step in init.py

Add the import and instantiation to the phase's __init__.py:

# In llmdbenchmark/standup/steps/__init__.py
from llmdbenchmark.standup.steps.step_11_my_custom_step import MyCustomStep

def get_standup_steps() -> list[Step]:
    return [
        EnsureInfraStep(),
        # ... existing steps ...
        SmoketestStep(),
        MyCustomStep(),  # Add here in order
    ]

7. Choose the right step number

Steps execute in number order. Rules:

Pick a number that places your step in the correct logical position.
Global steps with numbers lower than the first per-stack step run before parallel execution. Global steps with higher numbers run after.
Leave gaps between step numbers when possible (e.g., 10, 20, 30) to allow future insertions without renumbering.
Within a phase, numbers must be unique.

Note: standup skips step 01 (reserved). Choose numbers that don't conflict with existing steps.

Complete Examples

Standup Example: Deploy a Custom CRD

"""Step 11 -- Deploy a custom CRD before model serving starts."""

from pathlib import Path
from llmdbenchmark.executor.step import Step, StepResult, Phase
from llmdbenchmark.executor.context import ExecutionContext


class DeployCustomCrdStep(Step):
    """Deploy a custom Kubernetes CRD required by the workload."""

    def __init__(self):
        super().__init__(
            number=11,
            name="deploy_custom_crd",
            description="Deploy custom CRD for workload integration",
            phase=Phase.STANDUP,
            per_stack=False,
        )

    def should_skip(self, context: ExecutionContext) -> bool:
        # Skip if running in non-admin mode (CRDs require cluster-admin)
        return context.non_admin

    def execute(
        self, context: ExecutionContext, stack_path: Path | None = None
    ) -> StepResult:
        errors: list[str] = []
        cmd = context.require_cmd()

        crd_yaml = context.base_dir / "config" / "crds" / "my-crd.yaml"
        if not crd_yaml.exists():
            return StepResult(
                step_number=self.number,
                step_name=self.name,
                success=False,
                message="CRD file not found",
                errors=[f"Expected CRD at {crd_yaml}"],
            )

        if context.dry_run:
            context.logger.log_info(f"[DRY RUN] Would apply {crd_yaml}")
            return StepResult(
                step_number=self.number,
                step_name=self.name,
                success=True,
                message="[DRY RUN] CRD apply logged",
            )

        result = cmd.kube("apply", "-f", str(crd_yaml), check=False)
        if not result.success:
            errors.append(f"CRD apply failed: {result.stderr}")

        if errors:
            return StepResult(
                step_number=self.number,
                step_name=self.name,
                success=False,
                message="CRD deployment failed",
                errors=errors,
            )

        return StepResult(
            step_number=self.number,
            step_name=self.name,
            success=True,
            message="Custom CRD deployed",
        )

from llmdbenchmark.standup.steps.step_11_deploy_custom_crd import DeployCustomCrdStep

def get_standup_steps() -> list[Step]:
    return [
        # ... existing steps 00-09 ...
        DeployCustomCrdStep(),
    ]

Run Example: Warm Up the Model Before Benchmarking

"""Step 03b -- Send warmup requests to the model before benchmarking."""

from pathlib import Path
from llmdbenchmark.executor.step import Step, StepResult, Phase
from llmdbenchmark.executor.context import ExecutionContext


class WarmupModelStep(Step):
    """Send warmup requests to prime KV caches before benchmark."""

    def __init__(self):
        super().__init__(
            number=4,  # After verify_model (03), before render_profiles (04)
            name="warmup_model",
            description="Send warmup requests to prime model caches",
            phase=Phase.RUN,
            per_stack=False,
        )

    def should_skip(self, context: ExecutionContext) -> bool:
        return context.dry_run or context.harness_skip_run

    def execute(
        self, context: ExecutionContext, stack_path: Path | None = None
    ) -> StepResult:
        cmd = context.require_cmd()
        namespace = context.require_namespace()

        # Use the first deployed endpoint
        if not context.deployed_endpoints:
            return StepResult(
                step_number=self.number,
                step_name=self.name,
                success=True,
                message="No endpoints deployed, skipping warmup",
            )

        endpoint = list(context.deployed_endpoints.values())[0]
        context.logger.log_info(f"Sending 5 warmup requests to {endpoint}...")

        # Run a curl pod to send warmup requests
        result = cmd.kube(
            "run", "warmup-probe", "--rm", "--attach", "--quiet",
            "--restart=Never", "--namespace", namespace,
            "--image=curlimages/curl",
            "--command", "--", "sh", "-c",
            f"for i in $(seq 1 5); do "
            f"curl -s -X POST {endpoint}/v1/completions "
            f"-H 'Content-Type: application/json' "
            f"-d '{{\"model\":\"{context.model_name}\",\"prompt\":\"warmup\",\"max_tokens\":1}}'; "
            f"done",
            check=False,
        )

        return StepResult(
            step_number=self.number,
            step_name=self.name,
            success=True,
            message="Warmup requests sent",
        )

Note: this example is illustrative. The code uses number=4 but the comment says "Step 03b". In practice, inserting at number 4 would conflict with the existing RenderProfilesStep (number=4). You would need to renumber existing steps or choose a non-conflicting number such as 3 (between existing steps 03 and 04), using a numbering scheme that avoids collisions.

Teardown Example: Archive Results to S3

"""Step 05 -- Archive benchmark results to S3 after teardown."""

from pathlib import Path
from llmdbenchmark.executor.step import Step, StepResult, Phase
from llmdbenchmark.executor.context import ExecutionContext


class ArchiveResultsStep(Step):
    """Upload workspace results to S3 for long-term storage."""

    def __init__(self):
        super().__init__(
            number=5,
            name="archive_results",
            description="Archive results to S3",
            phase=Phase.TEARDOWN,
            per_stack=False,
        )

    def should_skip(self, context: ExecutionContext) -> bool:
        return context.dry_run

    def execute(
        self, context: ExecutionContext, stack_path: Path | None = None
    ) -> StepResult:
        cmd = context.require_cmd()
        results_dir = context.workspace / "results"

        if not results_dir.exists():
            return StepResult(
                step_number=self.number,
                step_name=self.name,
                success=True,
                message="No results directory to archive",
            )

        s3_dest = f"s3://my-benchmark-bucket/{context.cluster_name}/{context.namespace}/"
        context.logger.log_info(f"Archiving {results_dir} to {s3_dest}")

        result = cmd.execute(
            f"aws s3 sync {results_dir} {s3_dest}",
            check=False,
        )

        if not result.success:
            return StepResult(
                step_number=self.number,
                step_name=self.name,
                success=False,
                message="S3 upload failed",
                errors=[result.stderr],
            )

        return StepResult(
            step_number=self.number,
            step_name=self.name,
            success=True,
            message=f"Results archived to {s3_dest}",
        )

3. Step Execution Flow

Discovery and Ordering

Each phase has a get_*_steps() function in its __init__.py that returns a list of step instances. StepExecutor.__init__ sorts them by step.number:

self.steps = sorted(steps, key=lambda s: s.number)

Pre-Global / Per-Stack / Post-Global Partition

StepExecutor._partition_steps() splits the sorted steps:

Separate steps into global_steps (per_stack=False) and per_stack_steps (per_stack=True).
Find the lowest per-stack step number (min_per_stack).
Global steps with number < min_per_stack become pre-global.
Global steps with number >= min_per_stack become post-global.

Parallel Execution Across Stacks

For per-stack steps with multiple rendered stacks, _execute_stacks_parallel() uses ThreadPoolExecutor with max_workers = min(max_parallel_stacks, len(stacks)). Each stack runs its per-stack steps sequentially. Stacks execute in parallel.

Error Handling

Global steps: If a global step fails, execution aborts immediately (_execute_global_steps returns True). No further steps run.
Per-stack steps: If a per-stack step fails within a stack, that stack's remaining steps are skipped (break in the step loop), but other stacks continue.
Exceptions: _safe_execute_step() catches all exceptions and wraps them in a failed StepResult, preventing one step from crashing the entire pipeline.

should_skip() Behavior

Before executing any step, the executor calls step.should_skip(context). If it returns True, the step is logged as skipped and a successful StepResult with message="Skipped" is recorded. The step's execute() method is never called.

Dry Run Mode

Steps must check context.dry_run themselves inside execute(). The executor does not skip steps in dry-run mode automatically. The convention is to log what would happen and return a successful result:

if context.dry_run:
    return StepResult(
        step_number=self.number,
        step_name=self.name,
        success=True,
        message="[DRY RUN] Would have done X",
    )

Step Filtering

Users can run specific steps with --steps "0,3-5,9". StepExecutor.execute() passes the spec to parse_step_list() which returns a set of allowed step numbers. Steps not in the set are excluded from partitioning.

4. ExecutionContext Deep Dive

All Fields

Field	Type	Set By	Description
`plan_dir`	`Path`	CLI init	Directory containing rendered plan files
`workspace`	`Path`	CLI init	Working directory for this run
`base_dir`	`Path \| None`	CLI init	Project root (for templates, scenarios)
`specification_file`	`str \| None`	CLI init	Resolved `--spec` path
`rendered_stacks`	`list[Path]`	Plan rendering	Paths to rendered stack directories
`dry_run`	`bool`	CLI flag	If True, commands are logged but not executed
`verbose`	`bool`	CLI flag	Enable verbose output
`non_admin`	`bool`	CLI flag	Skip steps requiring cluster-admin
`current_phase`	`Phase`	Phase runner	Currently executing phase
`deep_clean`	`bool`	CLI flag	Teardown: wipe all resources
`release`	`str`	Default `"llmdbench"`	Helm release name prefix
`cluster_url`	`str \| None`	Step 00	Kubernetes API server URL
`cluster_token`	`str \| None`	Step 00	Bearer token for API server auth
`kubeconfig`	`str \| None`	CLI / env	Path to kubeconfig
`is_openshift`	`bool`	Step 00	True if cluster is OpenShift
`is_kind`	`bool`	Step 00	True if cluster is Kind
`is_minikube`	`bool`	Step 00	True if cluster is Minikube
`cluster_name`	`str \| None`	Step 00	Hostname from API server URL
`cluster_server`	`str \| None`	Step 00	Full API server URL
`context_name`	`str \| None`	Step 00	Kube context name
`username`	`str \| None`	Step 00	Current user for labeling
`namespace`	`str \| None`	Plan config / CLI	Model deployment namespace
`harness_namespace`	`str \| None`	Plan config / CLI	Benchmark harness namespace
`wva_namespace`	`str \| None`	Plan config / CLI	WVA namespace
`proxy_uid`	`int \| None`	Step 00	OpenShift UID range (first_uid + 1)
`model_name`	`str \| None`	Plan config	Model identifier (e.g. `meta-llama/Llama-3.1-8B`)
`accelerator_resource`	`str \| None`	Step 03	Node accelerator resource (e.g. `nvidia.com/gpu`)
`network_resource`	`str \| None`	Step 03	Network resource (e.g. `rdma/rdma_shared_device_a`)
`deployed_endpoints`	`dict[str, str]`	Standup	Endpoint URLs keyed by stack name
`deployed_methods`	`list[str]`	Standup	Deploy methods used (standalone, modelservice)
`deployed_pod_names`	`list[str]`	Run step 06	Harness pod names deployed
`experiment_ids`	`list[str]`	Run step 06	Experiment IDs for result collection
`experiment_treatments`	`list[dict] \| None`	Run phase	Treatment overrides from experiment YAML
`results_dir`	`Path \| None`	Run phase	Where results are written
`harness_name`	`str \| None`	Run phase	Active harness (inference-perf, guidellm, etc.)
`harness_profile`	`str \| None`	Run phase	Workload profile filename
`experiment_treatments_file`	`str \| None`	Run phase	Path to experiment treatments file
`profile_overrides`	`str \| None`	Run phase	Profile override string
`harness_output`	`str`	Default `"local"`	Output destination (local, gs://, s3://)
`harness_parallelism`	`int`	Default `1`	Parallel harness pods per treatment
`harness_wait_timeout`	`int`	Default `3600`	Seconds to wait for harness completion
`harness_debug`	`bool`	Default `False`	Enable debug mode for harness pods
`harness_skip_run`	`bool`	Default `False`	Skip harness run (collect-only mode)
`harness_service_account`	`str \| None`	Run phase	Service account for harness pods
`harness_envvars_to_pod`	`str \| None`	Run phase	Extra env vars forwarded to harness pods
`analyze_locally`	`bool`	Default `False`	Run analysis locally instead of in-cluster
`endpoint_url`	`str \| None`	CLI flag	Existing endpoint URL (run-only mode)
`run_config_file`	`str \| None`	CLI flag	Pre-built run config file (run-only mode)
`generate_config_only`	`bool`	Default `False`	Generate config without executing run
`dataset_url`	`str \| None`	CLI flag	Custom dataset URL for harness
`kubectl_cmd`	`str`	Default `"kubectl"`	Path to kubectl binary
`helm_cmd`	`str`	Default `"helm"`	Path to helm binary
`helmfile_cmd`	`str`	Default `"helmfile"`	Path to helmfile binary
`python_cmd`	`str`	Default `"python3"`	Path to python binary
`logger`	`LoggerProtocol`	CLI init	Logger instance
`cmd`	`CommandExecutor`	`rebuild_cmd()`	Shared command executor for kubectl/helm/shell

Accessing Cluster Commands

Use context.require_cmd() to get the CommandExecutor. It raises RuntimeError if not yet initialized:

cmd = context.require_cmd()

# Run kubectl commands
result = cmd.kube("get", "pods", "--namespace", ns, check=False)

# Run arbitrary shell commands
result = cmd.execute("aws s3 ls s3://bucket", check=False, silent=True)

Accessing Plan Config

Steps can load the merged YAML config from rendered stacks using helper methods inherited from Step:

# Load config from the first rendered stack
plan_config = self._load_plan_config(context)

# Load config from a specific stack (for per_stack steps)
stack_config = self._load_stack_config(stack_path)

# Require a nested config key (raises KeyError if missing)
model_name = self._require_config(plan_config, "model", "name")

# Resolve with three-tier fallback: context value > plan config > default
port = self._resolve(plan_config, "vllmCommon.inferencePort",
                     context_value=None, default=8000)

Logging

Use the logger attached to context:

context.logger.log_info("Starting deployment...")
context.logger.log_warning("PVC size mismatch, continuing")
context.logger.log_error("Deployment failed: timeout")
context.logger.set_indent(1)  # Indent subsequent messages

The LoggerProtocol (in llmdbenchmark/executor/protocols.py) defines the interface: log_info, log_warning, log_error, set_indent.

How Steps Share State

Steps communicate through ExecutionContext fields. For example:

Standup step 06 (deploy) writes to context.deployed_methods and context.deployed_endpoints
Smoketest phase reads context.deployed_methods to determine the pod selector
Run step 06 (deploy harness) writes to context.experiment_ids and context.deployed_pod_names
Run step 08 (collect results) reads context.experiment_ids to know which results to fetch

5. How to Add a New Analysis Module

Where Analysis Code Lives

All analysis code is in llmdbenchmark/analysis/:

llmdbenchmark/analysis/
    __init__.py              # Main run_analysis() entry point
    cross_treatment.py       # Cross-treatment comparison (CSV + plots)
    per_request_plots.py     # Per-request distribution plots
    visualize_metrics.py     # Prometheus metrics visualization
    benchmark_report/        # Benchmark report format converters
        native_to_br0_1.py   # Convert harness output to BR v0.1
        native_to_br0_2.py   # Convert harness output to BR v0.2
        ...

How to Add a New Plot Type

The analysis pipeline in __init__.py calls plotting functions in sequence. To add a new plot:

Create a new module, e.g., llmdbenchmark/analysis/my_custom_plots.py:

from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from llmdbenchmark.executor.context import ExecutionContext


def generate_my_plots(
    results_dir: Path,
    output_dir: Path | None = None,
    context: "ExecutionContext | None" = None,
) -> int:
    """Generate custom plots. Returns the number of plots generated."""
    try:
        import matplotlib
        matplotlib.use("Agg")
        import matplotlib.pyplot as plt
    except ImportError:
        return 0

    if output_dir is None:
        output_dir = results_dir / "analysis" / "custom"
    output_dir.mkdir(parents=True, exist_ok=True)

    # Load data, generate plots, save PNGs...
    # Return the count of generated plots
    return 0

Call it from run_analysis() in llmdbenchmark/analysis/__init__.py. Add a new section after the existing plot generation steps:

# --- 6. Generate custom plots ---
from llmdbenchmark.analysis.my_custom_plots import generate_my_plots
try:
    count = generate_my_plots(results_dir, context=context)
    if count:
        _log(context, f"Generated {count} custom plot(s)")
except Exception as exc:
    _log(context, f"Custom plot generation failed: {exc}", warning=True)

How to Add a New Metric to Cross-Treatment Comparison

Edit METRICS_OF_INTEREST in llmdbenchmark/analysis/cross_treatment.py. Each entry is a tuple of (dotted_path_in_br_v0.2, csv_column_name):

METRICS_OF_INTEREST = [
    # ... existing metrics ...
    ("results.request_performance.aggregate.latency.time_to_first_token.mean", "ttft_mean_s"),
    # Add your new metric:
    ("results.custom_section.my_metric.value", "my_metric_value"),
]

To also generate a comparison bar chart for it, add an entry to the plot_specs list in _generate_comparison_plots():

plot_specs = [
    # ... existing specs ...
    ("my_metric_value", "My Custom Metric", "unit", True),  # True = higher is better
]

Analysis Pipeline Chain

The analysis pipeline runs in two stages:

Per-treatment analysis (run_analysis() in __init__.py):
- Converts raw harness output to benchmark report v0.1 and v0.2 YAML
- Extracts summary from stdout.log
- Runs harness-specific post-processing (e.g., inference-perf --analyze)
- Generates metric time-series plots from Prometheus data
- Generates per-request distribution plots (histograms, CDFs, scatter)
Cross-treatment analysis (generate_cross_treatment_summary() in cross_treatment.py):
- Reads benchmark report v0.2 files from all treatment subdirectories
- Produces a CSV summary table (one row per treatment)
- Generates comparison bar charts and scatter/line plots
- Generates overlaid CDF plots comparing distributions across treatments

6. How to Add a New Harness

A harness is a load generator that sends requests to the model endpoint. Adding one requires files in three areas.

Required Files

Harness script in workload/harnesses/: Create my-harness-llm-d-benchmark.sh (or .py). This script is the entrypoint that the harness pod runs. It receives configuration through environment variables and the profile ConfigMap.

Profile templates in workload/profiles/my-harness/: Create .yaml.in files that define workload configurations. Use REPLACE_ENV_* placeholders for values injected at runtime:

# workload/profiles/my-harness/random_test.yaml.in
executable: my-harness-binary
model: REPLACE_ENV_LLMDBENCH_DEPLOY_CURRENT_MODEL
base-url: REPLACE_ENV_LLMDBENCH_HARNESS_STACK_ENDPOINT_URL
num-requests: 100
concurrency: 8

Analysis integration in llmdbenchmark/analysis/__init__.py: Register the harness in the module-level dictionaries:

_RESULT_PATTERNS["my-harness"] = "*.json"          # Glob for result files
_WRITER_NAMES["my-harness"] = "my-harness"          # benchmark_report writer name
_SUMMARY_MARKERS["my-harness"] = "Results Summary"  # Log marker for summary extraction

Benchmark report converter (optional but recommended): Add conversion functions to llmdbenchmark/analysis/benchmark_report/ in both native_to_br0_1.py and native_to_br0_2.py so raw output can be normalized to the standard benchmark report format.

Harness Pod Template

The harness pod template is at config/templates/jinja/20_harness_pod.yaml.j2. It uses the harness.name config value to select the correct entrypoint script. If your harness needs a custom container image, you may need to update the Dockerfile as well.

7. How to Add a New Scenario (Well-Lit Path)

Scenarios define a deployment configuration (model, GPU type, vLLM settings, etc.). They are the primary way users customize what gets deployed.

File Structure

Each scenario requires two files:

Specification template (.yaml.j2) in config/specification/: Points to the defaults file, template directory, and scenario file.
Scenario defaults (.yaml) in config/scenarios/: Contains the actual configuration overrides.

Step 1: Create the Specification Template

Create config/specification/examples/my-scenario.yaml.j2:

# My Custom Scenario Specification
{% set base_dir = base_dir | default('../') -%}
base_dir: {{ base_dir }}

values_file:
  path: {{ base_dir }}/config/templates/values/defaults.yaml

template_dir:
  path: {{ base_dir }}/config/templates/jinja

scenario_file:
  path: {{ base_dir }}/config/scenarios/examples/my-scenario.yaml

The specification template is a Jinja2 file rendered by RenderSpecification (in llmdbenchmark/parser/render_specification.py). The base_dir variable is injected automatically from the project root or --base-dir CLI argument.

Step 2: Create the Scenario File

Create config/scenarios/examples/my-scenario.yaml:

scenario:
  - name: "my-custom-deployment"

    model:
      name: meta-llama/Llama-3.1-8B-Instruct

    vllmCommon:
      tensorParallelism: 2
      flags:
        disableLogRequests: true
        noPrefixCaching: true

    standalone:
      enabled: true
      replicas: 1

    modelservice:
      enabled: false

    harness:
      name: inference-perf
      experimentProfile: sanity_random.yaml

The scenario key is a list. Each entry becomes a "stack" that gets deployed. Most scenarios define a single stack. Parameters not specified here inherit from config/templates/values/defaults.yaml.

The deployment method must be exactly one of standalone or modelservice. Both standalone.enabled and modelservice.enabled must be explicitly set so that the rendered templates know which path to take.

Multi-stack example

When a scenario defines more than one stack (e.g. to run multiple models behind the same gateway), lift scenario-wide settings into the top-level shared: block and keep per-stack blocks to just the model-specific knobs.

# config/scenarios/examples/my-two-model-scenario.yaml
shared:
  modelservice: { enabled: true }
  standalone:   { enabled: false }
  httpRoute:
    mode: shared
    name: multi-model-route
    pathPrefix: /{stack.name}
    rewriteTo: /
  wva:
    enabled: true
    image: { tag: v0.6.0 }

scenario:
  - name: qwen3-06b
    model: { name: Qwen/Qwen3-0.6B, ... }
    decode: { replicas: 1 }
    wva:
      variantAutoscaling: { minReplicas: 1, maxReplicas: 4 }
      hpa:                { minReplicas: 1, maxReplicas: 4 }
  - name: llama-31-8b
    model: { name: unsloth/Meta-Llama-3.1-8B, ... }
    decode: { replicas: 1 }
    wva:
      variantAutoscaling: { minReplicas: 1, maxReplicas: 4 }
      hpa:                { minReplicas: 1, maxReplicas: 4 }

See the next subsection for how shared: merges with per-stack and the render-time behavior (auto-named PVCs, shared HTTPRoute, stack-index guards). For a fully-annotated real scenario, see guides/multi-model-wva.yaml.

Multi-Stack Scenarios and the `shared:` Block

The scenario file supports an optional top-level shared: block alongside the scenario: list. Values in shared: are merged into every stack before the per-stack overrides, letting you lift scenario-wide settings (gateway, WVA controller, shared HTTPRoute config, chart versions, plugin config) out of per-stack blocks. Per-stack always wins, so a stack can still opt out of any shared value by setting it explicitly.

# config/scenarios/guides/multi-model-wva.yaml (abridged)
shared:
  modelservice: { enabled: true }
  standalone:   { enabled: false }
  wva:
    enabled: true
    image: { repository: ghcr.io/llm-d/llm-d-workload-variant-autoscaler, tag: v0.6.0 }
  httpRoute:
    mode: shared
    name: multi-model-route
    pathPrefix: /{stack.name}
    rewriteTo: /
  vllmCommon:
    flags: { enforceEager: true }

scenario:
  - name: qwen3-06b
    model: { name: Qwen/Qwen3-0.6B, ... }
    decode: { replicas: 1, resources: { ... } }
    wva:
      variantAutoscaling: { minReplicas: 1, maxReplicas: 4 }
      hpa:                { minReplicas: 1, maxReplicas: 4 }
  - name: llama-31-8b
    model: { name: unsloth/Meta-Llama-3.1-8B, ... }
    # same shape

Resource-name collision handling (multi-stack only). When two or more stacks share a namespace, the render engine auto-suffixes a small set of shipped-default resource names with the stack's model_id_label so Helm releases and Kubernetes objects don't collide:

Default path	Rewritten to (N >= 2)
`downloadJob.name` (`download-model`)	`download-model-{model_id_label}`
`inferenceExtension.monitoring.secretName`	`...-sa-metrics-reader-secret-{model_id_label}`

Explicit overrides (in defaults.yaml, shared:, or a stack) are preserved. Single-stack scenarios skip the rewrite entirely.

Model PVCs are shared, not per-stack. storage.modelPvc.name deliberately is not in the auto-suffix list - every stack writes its weights to a distinct subdirectory on one shared PVC (keyed by model.path). That matches how NVMe-backed and local-directory storage classes are typically deployed (one volume with per-model subdirs, not N volumes), and avoids wasting pre-provisioned disk. Operators who genuinely need per-model volumes can override storage.modelPvc.name per-stack in the scenario.

See _resolve_per_stack_identity for the implementation and _STACK_SCOPED_DEFAULTS for the full list.

Shared HTTPRoute template. When httpRoute.mode: shared, only the first stack's render of 08_httproute.yaml.j2 emits a non-empty file - a single HTTPRoute with one backendRef per sibling stack. Sibling stacks are exposed to templates via the injected siblingStacks list and stackIndex variable (see _build_sibling_stacks). The same stackIndex > 1 -> empty gate dedupes cluster-shared infra templates - 09_helmfile-gateway-provider.yaml.j2 (istio control plane) and the infra-llmdbench release in 10_helmfile-main.yaml.j2 - so N stacks don't race on the same Helm release during parallel per-stack step execution.

How Templates Are Rendered

RenderSpecification renders the .yaml.j2 spec to resolve base_dir paths and writes the result as YAML.
RenderPlans (in llmdbenchmark/parser/render_plans.py) loads the defaults YAML and the scenario YAML. For each stack it applies a four-layer merge:
```
defaults.yaml  ->  scenario.shared  ->  stack config  ->  CLI / setup overrides
```
Each later layer wins over earlier ones; dicts deep-merge, lists replace wholesale. After merging, RenderPlans resolves model IDs, per-stack identity names, and substitutes ${dotted.path} references, then renders each Jinja2 template in config/templates/jinja/ with the final values.
Output goes to one directory per stack under the plan directory (e.g., plan/my-custom-deployment/, or plan/qwen3-06b/, plan/llama-31-8b/, ...).

Custom Jinja2 Templates

To add infrastructure YAML that gets rendered per-stack, create a new .yaml.j2 file in config/templates/jinja/. Use a numeric prefix to control ordering:

config/templates/jinja/24_my-custom-resource.yaml.j2

Files prefixed with _ are treated as partials/macros and are not rendered directly. All other .yaml.j2 files are rendered for each stack.

Templates receive the full merged config dictionary. Access values with Jinja2 syntax:

# config/templates/jinja/24_my-custom-resource.yaml.j2
{% if myFeature is defined and myFeature.enabled %}
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ release }}-my-config
  namespace: {{ namespace.name }}
data:
  setting: "{{ myFeature.setting | default('default-value') }}"
{% endif %}

Usage

Run the scenario with:

llmdbenchmark --spec my-scenario standup
llmdbenchmark --spec my-scenario run
llmdbenchmark --spec my-scenario teardown

The --spec flag resolves to the specification file by searching config/specification/ directories for a matching name.

8. How to Add a Smoketest Validator

Smoketest validators run during the SMOKETEST phase to verify that a scenario's rendered config matches expectations before the benchmark run begins.

Create the Validator File

Add a new file in llmdbenchmark/smoketests/validators/, e.g., my_scenario.py. The file name must match the stack directory name -- for example, a stack named my-scenario maps to my_scenario.py.

Subclass ScenarioValidator

Extend ScenarioValidator from llmdbenchmark/smoketests/validators/base.py and implement the run_config_validation method:

from llmdbenchmark.smoketests.validators.base import ScenarioValidator


class MyScenarioValidator(ScenarioValidator):
    """Validate the my-scenario deployment configuration."""

    def run_config_validation(self, config, cmd, namespace, logger):
        # Validate that expected pods are running with the correct config.
        # Checks come from the rendered config.yaml, not hardcoded values.
        pods = self.get_pod_specs(cmd, namespace, label="app=my-scenario")

        self.validate_role_pods(pods, expected_count=1, role="server")
        self.assert_env_equals(pods[0], "MODEL_NAME", config["model"]["name"])
        self.assert_arg_contains(pods[0], "--tensor-parallel-size",
                                 str(config["vllmCommon"]["tensorParallelism"]))

Available Helper Methods

validate_role_pods(pods, expected_count, role) -- Assert the expected number of pods exist for a given role.
assert_env_equals(pod_spec, env_var, expected_value) -- Check that a container environment variable has the expected value.
assert_arg_contains(pod_spec, arg_name, expected_value) -- Check that a container argument contains the expected value.
get_pod_specs(cmd, namespace, label) -- Retrieve pod specs filtered by label selector.

Register the Validator

Add an entry to VALIDATOR_REGISTRY in llmdbenchmark/smoketests/validators/__init__.py:

from llmdbenchmark.smoketests.validators.my_scenario import MyScenarioValidator

VALIDATOR_REGISTRY = {
    # ... existing validators ...
    "my-scenario": MyScenarioValidator,
}

The registry key must match the stack directory name.

Key Principles

All checks should derive from the rendered config.yaml, not hardcoded values. This ensures the validator stays in sync with scenario changes.
See llmdbenchmark/smoketests/README.md for the full check system documentation, including the complete list of built-in assertions and advanced usage patterns.

9. How to Add a New Experiment

Experiments define a Design of Experiments (DoE) matrix with setup treatments (infrastructure variations) and run treatments (workload parameter sweeps).

Experiment YAML Format

Create a file in workload/experiments/, e.g., workload/experiments/my-experiment.yaml:

experiment:
  name: my-experiment
  description: >
    Measure throughput under increasing concurrency
    with different tensor parallelism settings.
  harness: vllm-benchmark
  profile: random_concurrent.yaml

design:
  type: factorial

  setup:
    factors:
      - name: tensor_parallelism
        key: vllmCommon.tensorParallelism
        levels: [1, 2, 4]

  run:
    factors:
      - name: max-concurrency
        key: max-concurrency
        levels: [1, 8, 32]

    constants:
      - key: num-prompts
        value: 100
      - key: dataset-name
        value: random

# Runtime: setup treatments consumed by the experiment orchestrator.
# Each treatment specifies config overrides applied during plan rendering.
setup:
  treatments:
    - name: tp1
      vllmCommon.tensorParallelism: 1
    - name: tp2
      vllmCommon.tensorParallelism: 2
    - name: tp4
      vllmCommon.tensorParallelism: 4

# Runtime: run treatments consumed by step_04 render_profiles.
# Each treatment specifies harness profile overrides.
treatments:
  - name: conc1
    max-concurrency: 1
    num-prompts: 100
  - name: conc8
    max-concurrency: 8
    num-prompts: 100
  - name: conc32
    max-concurrency: 32
    num-prompts: 100

Setup Treatments vs Run Treatments

Setup treatments (setup.treatments) control the infrastructure deployed. Each treatment triggers a full standup/run/teardown cycle. Override keys use dotted paths into the scenario config (e.g., vllmCommon.tensorParallelism, decode.replicas, standalone.enabled). These overrides are passed to RenderPlans as setup_overrides and deep-merged into the scenario config during template rendering.

Run treatments (treatments) control the workload parameters for each benchmark run. Each treatment overrides profile values (e.g., max-concurrency, num-prompts). Within a single setup treatment, all run treatments execute sequentially against the same deployed stack.

How to Reference Scenario Overrides

Setup treatment keys map directly to the scenario YAML structure. For example:

setup:
  treatments:
    - name: pd-2decode
      decode.replicas: 2              # Sets scenario[].decode.replicas
      prefill.replicas: 4             # Sets scenario[].prefill.replicas
      standalone.enabled: false       # Disables standalone mode

Running an Experiment

llmdbenchmark --spec my-scenario experiment \
  --experiments workload/experiments/my-experiment.yaml

The experiment orchestrator (_execute_experiment in llmdbenchmark/cli.py):

Parses the experiment YAML via parse_experiment()
Falls back experiment-level harness and profile to CLI args if not provided
For each setup treatment:
- Renders plans with the treatment's config overrides
- Runs standup
- Runs all run treatments (each as a separate run phase)
- Runs teardown (unless --skip-teardown)
Records success/failure per treatment in ExperimentSummary
Writes experiment-summary.yaml to the workspace

Options:

--stop-on-error: Abort on first failure (default: continue to next treatment)
--skip-teardown: Leave stacks running for debugging

FilesExpand file tree

developer-guide.md

Latest commit

History