This note defines what Plato calls faithful DiLoCo in the current
implementation.
Faithful DiLoCo in Plato means algorithm-faithful execution of the DiLoCo
training loop inside Plato's federated runtime. It does not mean reproducing
the paper's exact C4 dataset, model scale, tokenizer, hardware topology,
pretraining duration, or final benchmark numbers.
Example Configurations
Plato includes MNIST/LeNet and CIFAR-10/ResNet-18 comparison configurations
for checking DiLoCo against matched FedAvg runs:
These examples validate Plato's DiLoCo mechanics without reproducing the C4
dataset, tokenizer, language-model scale, hardware topology, pretraining
duration, or final benchmark numbers from the paper.
Algorithm Contract
DiLoCo has two optimizer levels:
The client-local inner optimizer trains each selected logical client for
exactly H local optimizer steps between synchronizations.
The server-side outer optimizer updates the global model from the averaged
outer gradient.
The DiLoCo server must still return a Plato-compatible model delta because
algorithm.update_weights() adds the returned delta to the current global
model. For example, outer SGD with learning rate 1.0 returns the averaged
Plato delta and is equivalent to FedAvg only when the same averaging rule is
used.
The outer optimizer runs on the server. Clients run only the inner optimizer
and send model weights or weight-equivalent updates. Client-local optimizer and
scheduler state persists per logical client and is never sent to the server.
Local Work H
H means client-local optimizer steps between synchronizations. It is not:
epochs,
raw dataloader batches, or
gradient-accumulation micro-batches.
When gradient accumulation is enabled, H counts completed optimizer steps.
Raw batches that do not trigger optimizer.step() do not increment H.
H may be smaller than one epoch. Faithful DiLoCo must therefore stop local
training mid-epoch after exactly H optimizer steps. This early stop must
still run normal trainer cleanup, state persistence, callback completion, and
reporting paths. It must not perform an extra final optimizer step.
Small-H training must not repeatedly replay the same first H batches only
because the train loader is recreated each round. The implementation must use
round-aware resampling or an equivalent persistent sampling stream so each
logical client's local data stream advances across rounds in a reproducible
way.
State Ownership
Server-owned state:
the global model,
outer optimizer momentum or other outer optimizer state,
aggregation metadata needed to update the global model.
Client-owned state:
inner optimizer state, such as AdamW first and second moments,
scheduler state and global/local optimizer-step counters,
sampler or dataloader stream position needed for small-H continuity.
Client-owned optimizer and scheduler state must not appear in client-server
payloads. It must remain local to the logical client, including when training
uses subprocesses.
Parameter And Buffer Policy
By default, the outer optimizer applies only to trainable floating parameters.
This matches the algorithm definition, which optimizes model parameters.
Floating buffers, such as batch normalization running statistics, are
synchronized without outer momentum by default. They use the selected averaging
rule but do not receive server-side momentum or Nesterov treatment.
Non-floating buffers use conservative FedAvg-style behavior, including casting
or rounding as needed to preserve the buffer's dtype-compatible semantics.
The implementation may offer apply_outer_optimizer_to = "all_floating" for
experiments, but the default must remain parameters.
Configuration Contract
The faithful initial mode uses these configuration names and defaults:
[server]type="diloco"[algorithm]type="fedavg"[trainer]local_steps_per_round=Hpreserve_optimizer_state=trueoptimizer="AdamW"[server.diloco]outer_optimizer="nesterov"outer_learning_rate=0.7outer_momentum=0.9aggregation_weighting="uniform"# or "num_samples"apply_outer_optimizer_to="parameters"# or "all_floating"
algorithm.type = "fedavg" is intentional. Plato should reuse the existing
FedAvg weight extraction, delta computation, and global model loading path,
while server.type = "diloco" selects the server-side DiLoCo aggregation and
outer optimizer behavior.
aggregation_weighting = "uniform" matches the balanced worker setting most
closely. aggregation_weighting = "num_samples" matches Plato's traditional
sample-weighted FedAvg behavior. FedAvg equivalence for outer SGD with learning
rate 1.0 is valid only when both runs use the same weighting rule.
Unsupported modes must fail clearly. They must not silently fall back to an
approximate DiLoCo variant. Examples include trainer backends that cannot count
local optimizer steps exactly, execution paths that cannot preserve
client-local optimizer and scheduler state, samplers that cannot advance the
small-H local data stream across rounds, or payload paths that would send
optimizer state to the server. Experimental combinations that are allowed but
not faithful must warn clearly.
-id:D1depends_on:[]task:Document the exact DiLoCo contract and unsupported modes.-id:D2depends_on:[D1]task:Add red tests for server-side outer gradient sign, weighting, andFedAvg equivalence under matching weighting.-id:D3depends_on:[D2]task:Implement DiLoCo server aggregation and outer optimizer state for SGD,momentum SGD, and Nesterov.-id:D4depends_on:[D1]task:Add red tests for exact local optimizer-step counting and `H` smallerthan one epoch.-id:D5depends_on:[D4]task:Implement `trainer.local_steps_per_round` with mid-epoch terminationafter exactly `H` optimizer steps.-id:D6depends_on:[D1]task:Add red tests for per-client optimizer and scheduler statepersistence.-id:D7depends_on:[D6]task:Persist client-local optimizer and scheduler state without sending itto the server.-id:D8depends_on:[D1]task:Add red tests for round-aware small-`H` sampling.-id:D9depends_on:[D8]task:Implement round-aware resampling or an equivalent persistent samplingstream for each logical client.-id:D10depends_on:[D1]task:Add red tests for parameter and buffer eligibility.-id:D11depends_on:[D10]task:Implement the default trainable-parameter-only outer optimizer policyand conservative buffer synchronization.-id:D12depends_on:[D3,D5,D7,D9,D11]task:Wire exact DiLoCo configuration, examples, and user-facingdocumentation.-id:D13depends_on:[D12]task:Add end-to-end faithful-mode validation coverage.
Every implementation task should use red/green test-driven development. Add
the failing tests that describe the contract first, then implement the smallest
runtime change that makes those tests pass.