Back to Blogs

Spirit LM: Meta AI’s Multimodal Model for Seamless Text and Speech Generation

October 22, 2024
5 mins
blog image

Overview of SPIRIT LM

SPIRIT LM, or SPoken and WRitten Interleaved Transformer Language Model, is Meta AI’s latest venture into building a multimodal foundation model. Traditionally, large language models (LLMs) like GPT-3 and LLaMA have excelled in text-based tasks, and recent advancements in SpeechLMs have pushed the boundaries of speech understanding. SPIRIT LM aims to merge these two modalities—speech and text—into a single, coherent system capable of cross-modal generation and comprehension.

Overview of SPIRIT LM

Spirit LM: Interleaved Spoken and Written Language

This model stands out in its ability to freely mix text and speech, meaning it can handle tasks such as automatic speech recognition (ASR), text-to-speech (TTS), speech classification, and even expressive speech generation, all within a unified framework. SPIRIT LM offers two versions:

  • SPIRIT LM BASE: uses phonetic units for speech modeling.
  • SPIRIT LM EXPRESSIVE: models not only the phonetics but also the pitch and style of speech, enabling it to capture the expressive nuances of spoken language.

Model Card of SPIRIT LM BASE and SPIRIT LM EXPRESSIVE

Model Card

Motivation Behind SPIRIT LM

Before SPIRIT LM, generating speech required piecing together multiple models in a pipeline: an ASR model to transcribe speech, a language model to generate text, and a TTS model to convert text back into speech. This fragmented approach, while functional, often resulted in loss of expressiveness and semantic alignment between text and speech. SPIRIT LM addresses this by training a single model that can understand, process, and generate language in both modalities without needing a separate pipeline.

The motivation is simple yet profound: combining the semantic depth of text-based LLMs with the expressiveness of SpeechLMs. By achieving this, SPIRIT LM is set to revolutionize human-AI interaction by enabling more natural and fluent multimodal conversations.

Key Features of SPIRIT LM

Interleaving Text and Speech Data

The core innovation of SPIRIT LM lies in its ability to process interleaved text and speech. The model is trained using a unique word-level interleaving technique, where text and speech tokens are combined into a single stream. This allows the model to learn from speech-text aligned datasets and develop a deeper understanding of how written and spoken language relate to each other.

During training, speech is encoded using HuBERT phonetic tokens (a self-supervised speech representation), while text is encoded using subword Byte Pair Encoding (BPE) tokens. This combination allows SPIRIT LM to switch fluidly between generating text and speech tokens in response to input prompts.

This is different from traditional text annotation as in this process text and speech are usually processed separately (e.g., text gets labeled; speech gets transcribed or tagged). However, SPIRIT LM skips this explicit separation, instead leveraging data in its raw or minimally processed form, focusing on their combined stream.

Few-Shot Learning Across Modalities

SPIRIT LM demonstrates few-shot learning capabilities not only in text-based tasks but also in cross-modal scenarios. For example, the model can perform ASR or TTS with just a few examples, allowing it to quickly adapt to new tasks without needing large-scale fine-tuning. This flexibility is essential for tasks like speech classification or sentiment analysis, where the model needs to understand and generate responses in both text and speech formats.

Expressivity in Speech

One of the most exciting aspects of SPIRIT LM is its focus on expressivity. The SPIRIT LM EXPRESSIVE version extends the model’s speech capabilities by incorporating pitch and style tokens in addition to phonetic units. This allows the model to capture not only what is being said but how it is being said, preserving the emotional and prosodic elements of speech.

This expressivity is particularly useful in tasks like emotion conversion or dialogue generation, where maintaining the tone and sentiment of speech is crucial. For instance, the model can generate a sad or happy version of the same sentence, making it adaptable to a wide range of conversational contexts.

Multimodal Sentiment Preservation

In addition to its expressiveness, SPIRIT LM excels in maintaining sentiment across modalities. The team at Meta AI introduced a new benchmark called SPEECH-TEXT SENTIMENT PRESERVATION (STSP) to evaluate how well the model preserves sentiment when switching between speech and text. Results show that SPIRIT LM is the first model capable of consistently maintaining sentiment both within and across modalities, making it a leader in multimodal sentiment analysis.

{Check out the demo page!}

Technical Architecture of SPIRIT LM

The backbone of SPIRIT LM is the LLaMA 2 architecture, fine-tuned with both text and speech data.

Technical Architecture of SPIRIT LM

Spirit LM: Interleaved Spoken and Written Language

Here’s a breakdown of the key components of the model:

  • Speech Encoder: The model uses HuBERT to convert raw audio into phonetic units. These units are clustered to represent the basic phonetic components of speech.
  • Text Encoder: For text, SPIRIT LM relies on Byte Pair Encoding (BPE) to tokenize written input into subword units, similar to how it is done in traditional LLMs.
  • Interleaving Mechanism: A special token system is used to mark transitions between speech and text, with speech tokens prefixed by [SPEECH] and text tokens by [TEXT]. The model randomly switches modalities at word boundaries, helping it learn to generate and understand both speech and text within the same context.
  • Speech Decoder: The speech tokens generated by SPIRIT LM are converted back into audio using a HiFi-GAN vocoder, trained specifically to handle expressive speech synthesis.

Read the paper or arXiv authored by Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux: Spirit LM: Interleaved Spoken and Written Language

Evaluation of SPIRIT LM

SPIRIT LM has been evaluated extensively across both text and speech modalities, with strong results in comprehension, sentiment preservation, and few-shot learning tasks.

Zero-Shot and Few-Shot Comprehension

SPIRIT LM was tested on tasks like WUGGY, BLIMP, and StoryCloze, which measure grammatical and semantic understanding. The model performed competitively, especially in speech comprehension, often outperforming speech-only baselines.

Zero-Shot and Few-Shot Comprehension

Spirit LM: Interleaved Spoken and Written Language

Sentiment Preservation

Using the Speech-Text Sentiment Preservation (STSP) benchmark, SPIRIT LM, particularly the EXPRESSIVE version, showed impressive results in maintaining sentiment across modalities. It accurately preserved emotional tones when switching between speech and text.

Few-Shot Learning

SPIRIT LM demonstrated few-shot learning capabilities, quickly adapting to new tasks like ASR and TTS with just a few examples, offering flexibility for low-resource tasks.

Zero-Shot and Few-Shot Comprehension

Spirit LM: Interleaved Spoken and Written Language

Cross-Modal Alignment

The model exhibited strong alignment between speech and text tokens, with middle and higher layers showing effective cross-modal understanding.

Applications and Use Cases of SPIRIT LM

SPIRIT LM opens the door to a variety of applications across industries:

  • Assistive Technologies: By combining speech recognition and generation, SPIRIT LM can be used in voice assistants or accessibility tools for individuals with disabilities, enabling more natural and expressive interactions.
  • Content Creation: The model’s ability to switch between text and speech seamlessly can enhance podcasts, audiobooks, or even voice-acted video games, where expressive and dynamic speech generation is key.
  • Multimodal Translation: SPIRIT LM can be applied in speech-to-text or text-to-speech translation systems, improving real-time communication across different languages and mediums.
  • Sentiment Analysis: Businesses can leverage the model’s cross-modal sentiment preservation capabilities for customer service, ensuring that AI-driven interactions maintain the desired tone and mood, whether through chat or voice.

Limitations of SPIRIT LM

Despite its strengths, SPIRIT LM has some limitations:

Performance Degradation in Larger Models

Although SPIRIT LM performs well, there is some performance degradation when scaling beyond certain model sizes. Larger models, particularly the EXPRESSIVE version, showed a drop in accuracy on traditional text comprehension tasks like WUGGY and BLIMP. This is likely due to the added complexity introduced by pitch and style tokens, which increase the token sequence length and thus the difficulty of modeling expressive speech.

While SPIRIT LM is built on LLaMA 2, it does not perform as well as LLaMA 2 in pure text-based tasks, indicating that fine-tuning for both speech and text could affect the model’s original strengths in text generation.

Speech Generation Complexity

In speech-based tasks, particularly those requiring high fidelity in expressive speech generation, SPIRIT LM occasionally struggles with maintaining the quality of speech synthesis, especially in long-form text-to-speech generation. This is mainly due to the added complexity of handling phonetics, pitch, and style tokens simultaneously. Compared to models designed specifically for TTS tasks, SPIRIT LM’s results are slightly less natural and coherent.

Limited Non-English Support

SPIRIT LM is primarily trained in English, limiting its effectiveness in multilingual applications. Expanding to other languages would require substantial re-training.

Added Toxicity Risks

As with many large language models, SPIRIT LM carries the risk of generating harmful or toxic content, particularly when handling user inputs across modalities. The model was tested on HOLISTICBIAS, a dataset designed to trigger biased or toxic outputs. Results showed that SPIRIT LM occasionally generated toxic content, particularly in cross-modal scenarios where speech was converted to text or vice versa. This could be due to biases in the underlying training data, and further safety measures like instruction-tuning or red-teaming will be necessary for any deployment in real-world applications.

Trade-offs in Expressiveness

Adding expressive features increases model complexity, which can slow down performance and reduce the quality of outputs for non-expressive tasks.

SPIRIT LM: Ethical Considerations and Safety

While SPIRIT LM offers impressive capabilities, it also comes with certain risks, particularly regarding the generation of harmful or toxic content. Like other large language models, SPIRIT LM has the potential to produce harmful outputs if not carefully managed. To mitigate these risks, Meta AI emphasizes the need for red-teaming and instruction-tuning to ensure the model meets safety standards, particularly in user-facing applications.

Accessing SPIRIT LM

  • Interference Code on GitHub: You can find the implementation code for SPIRIT LM available on GitHub. This repository provides the necessary tools for developers to integrate and experiment with the model.
  • Model Weights Request: To access the model weights for SPIRIT LM,submit a request through the designated channel. This ensures responsible usage and compliance with Meta AI’s guidelines.
  • Model Card: For comprehensive information about SPIRIT LM, including its capabilities, intended uses, and limitations, refer to the model card. This resource is essential for understanding how to effectively leverage SPIRIT LM in various applications.

SPIRIT LM: Key Takeaways

  • Integration of Speech and Text: SPIRIT LM merges speech and text processing into a unified system, enabling seamless interaction and a range of applications from ASR to expressive speech generation.
  • Expressive and Sentiment-Preserving Capabilities: The EXPRESSIVE variant captures speech nuances like pitch and style while maintaining sentiment across modalities, making it valuable for emotionally rich applications.
  • Few-Shot Learning Flexibility: SPIRIT LM demonstrates impressive few-shot learning, quickly adapting to new tasks with minimal examples, making it suitable for low-resource environments.
  • Addressing Limitations and Ethical Considerations: While SPIRIT LM showcases innovative features, it faces challenges like performance degradation in larger models and the risk of generating biased content, necessitating careful implementation of safety measures.

If you're extracting images and text from PDFs to build a dataset for your multimodal AI model, be sure to explore Encord's Document Annotation Tool—to train and fine-tune high-performing NLP Models and LLMs.

encord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started
Written by
author-avatar-url

Ulrik Stig Hansen

View more posts

Explore our products