DynaMo


 DynaMo: Why Predict Just One Token at a Time?


Python PyTorch Huggingface Hits
work-in-progress


DynaMo is a suite of multi-token prediction models. Our models are instantiated using the Pythia weights. Our models parallely generate multiple tokens using independent token heads.

DynaMo Multi-token Model
Multi-token prediction in DynaMo. (a) Traditional autoregressive prediction (one token at a time) requires three forward passes. (b) Non-autoregressive multi-token prediction requires only one forward pass.

Training

We implement training using a modified CLM objective that trains each token head independently using the following loss function:

\[\mathcal{L}_{\text{T}n} = - \frac{1}{N} \sum_{j=1}^N \sum_{t=1}^{L - n + 1} \log p(\mathbf{x}_{t+n}^j|\mathbf{x}_{1:t}^j)\]

for the nth token head. Here, N is the number of sequences in the training set and the length of the jth sequence is L.

Multi-token Generation

DynaMo Text Generation
Flowchart of the DynaMo text generation pipeline.

We propose three novel techniques to execute multi-token generation using the DynaMo models:

  • Co-occurrence weighted masking: To fix the estimated joint probability distribution that makes the independence assumption, we apply co-occurrence weighted masking to filter out improbable token combinations.
  • Adaptive thresholding: We implement adaptive thresholding of the estimated joint probability distribution to further boost performance.
  • Dynamic back-off: We back-off to lower-order n-gram prediction when all the probabilities in the predicted joint probability distribution fall below a threshold.

We show comparisons with other text generation methods below.

DynaMo Comparisons
Comparisons with other approaches. *The skeleton-of-though decoding work evaluates models of different sizes.

Instruction Finetuning

We finetune our models on a filtered Alpaca dataset. This gives us chat versions of our multi-token models.

DynaMo-7.3B-T3-Chat vs. Pythia-6.9B
DynaMo on Pythia-6.9B.

Developer

Shikhar Tuli. For any questions, comments or suggestions, please reach me at stuli@princeton.edu.

License

We are working with Samsung’s legal team to publicly release the source code and model weights. We also plan to release the paper on arXiv soon. Stay tuned!