Upcycled Language Models

Published On: 11/14/24, 14:24

Author: Julian Bleecker

Contributor:

Tags

AIMAGAZINEDESIGN FICTION

Reference URLs

https://arxiv.org/abs/2410.07524

Type

CLASSIFIED AD

Title

Upcycled Language Models

Subtitle

Explainer

This is the abstract from this paper called Upcycling Large Language Models into Mixture of ExpertsUpcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel "virtual group" initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.

Descriptive Text

See 'em to believe 'em! I've got a baker's dozen of these great home-built & surplus upcycled pre-trained high- and medium-density language models that I made to be suitable for sparse Mixture of Experts mode. Great for general-purpose agents, domestic appliances, small scale civic reasoning systems, municipal infrastructures, etc. Up to 15B parameters, 1T token windows in many of these. Baseline upcycled Nemotron-5 that achieved 67.6% MMLU. Most are ablated and can even outperform dense model training. Ready to take possession immediately via any local entanglement node or public block address. Contact Edgar at 0xED098ef for more specs and evaluation keys.

No Text Array.

No Additional Details.