Software To Help You Turn Your Data Into AI
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.
In 1956, John McCarthy, a young associate professor of mathematics, convened 10 mathematicians and scientists for a two-month study about “thinking machines”. The group decided to hold a summer workshop based on the assumption that if mathematicians and scientists could describe every aspect of learning in a way that enabled a machine to simulate it, then they could begin to understand how to make machines use language, form abstractions, and solve problems. Today, many computer scientists consider the resulting Dartmouth Summer Research Project on Artificial Intelligence the event that launched artificial intelligence (AI) as a field of study.
AI is now a wide-ranging branch of computer science (which includes deep learning and machine learning), but, overall, it still focuses on building “thinking” machines: machines capable of demonstrating intelligence by performing tasks and solving problems that previously required human knowledge to do so.
Because these thinking machines can’t yet think on their own, a fundamental aspect of AI is teaching machines to think. Much like a baby learning to make sense of the world around her, a computer must be taught to make sense of the data it’s given. A subdivision of AI called machine learning aims to teach computers how to make inferences from patterns within datasets, and ultimately, develop computer systems that can learn and adapt without explicit programming.
Machine learning uses algorithms, data, computer power, and models to train machines to learn from their experiences. Machine learning models make it possible for computers to continuously improve and learn from their mistakes
Machine learning engineers build these models from mathematical algorithms. Algorithms are a sequence of instructions that tell the computer how to transform data into useful information. Data scientists typically design algorithms to solve specific problems and then run the algorithms on data so that they can “learn” to recognise patterns.
In a general sense, a machine learning model is a representation of what an algorithm is “learning” from processing the data. After running an algorithm on data, machine learning engineers save the rules, numbers, and other algorithm-specific data structures needed to make predictions – all of which combined make up the model. The model is like a program made up of the data and the instructions for how to use that data to make a prediction (a predictive algorithm).
After using algorithms to design predictive models, data scientists must train the model by feeding it data and using human expertise to assess how well it makes predictions. The model combs through mountains of data, and – with the help of human feedback along the way – it learns to weigh diverse inputs. It uses these inputs to learn to identify patterns, categorise information, create predictions, make decisions, and more.
Machine learning is an interactive process, which means the model learns based on its past experiences, just like a human would. The machine learning model “remembers” what it learned from working with a previous dataset – where it performed well and where it didn’t – and it will use this feedback to improve its performance with future datasets. If needed, data scientists can tweak the algorithm that built the model to reduce errors in its outputs.
Unlike a computer system that acts based on a predefined set of rules, after being trained on data, a machine learning model can perform tasks without being explicitly programmed to do so. However, the quality of the data that data scientists use to train the machine directly impacts how well the machine learns (more on that below).
Machine learning has many applications: it’s used in speech recognition, traffic predictions, virtual assistance, email filtering, and more.
At Encord, we help organisations using a type of machine learning called computer vision to create high-quality training data. Our platform automates data annotation, evaluation, and management. Because the training data directly impacts a computer vision model’s performance, having high-quality training data is incredibly important for the success of a computer vision model.
Computer vision is, to some extent, what its name implies: a field of AI that aims to help computers “see” the world around them. Computer vision models attempt to mimic the function of the human visual system by teaching the computer how to take in visual information, analyse it, and reach conclusions based on this analysis.
Data scientists have created, and continue to create, different machine learning models for different uses. For computer vision, a commonly used model is artificial neural networks.
Artificial neural networks (ANN) are computing systems inspired by the patterns of human brain cells and the ways in which biological neurons signal to one another. ANNs are made up of interconnected nodes arranged into a series of layers. An individual node often connects to several nodes in the layer below it, from which it receives data, and several in the layer above it, to which it sends data.
The input layer contains the initial dataset fed to the model, and it connects with the hidden, internal layers below. When data enters the hidden layers, each node performs a series of computations – multiplying and adding data together in complex ways – to transform it into a useful output and to determine whether it should pass the data onto the next layer. When this data reaches the output layer, the model takes what it’s learned and makes a prediction about the data.
Neural networks allow computers to process, analyse, and understand frames and videos, and they enable computers to extract meaningful information from the visual input in the way that humans do. Through the use of such models, a computer can interpret a visual environment and make decisions based on that input. However, unlike the human visual system that develops naturally over years, the computer has to be taught to “see” and make sense of a visual scene. Humans must train computer vision models to “see” by feeding them lots of high-quality data.
Applications of computer vision vary depending on the type of problem the model is trying to solve, but some of the most common tasks are image processing and classification, object detection, and image segmentation.
Example weather classification
Image classification is when a computer “sees” an image and can “categorise” it. Is there a house in this picture? Is this a picture of a dog or a cat? With a suitably trained image classification model, a computer can answer these questions.
When performing object detection, computer vision models learn to classify an object and detect its location. An object detection model could, for instance, identify that a car is in a video and track its movement from frame to frame.
Lastly, an image segmentation model distinguishes between an object and its background and other objects by creating a set of pixels for each object in the image. Compared to object detection, image segmentation provides a more granular understanding of the objects in an image.
Computer vision plays an important role in many industries. Consider the use of medical imaging in which doctors use AI to help them identify tumours. These computers have to learn to ‘see’ the tumours as distinct from other body tissue. Similarly, the computers running self-driving cars must be taught to “see” and avoid pedestrians and to process visual information to produce meaningful insight, such as identifying street signs and interpreting what they mean.
So who teaches a computer vision model to distinguish between a stop sign and a yield sign? Humans do, by creating a well-designed model and feeding it high-quality training data.
Interested in learning more? Schedule a demo to better understand how Encord can help your company unlock the power of AI.
Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI
Join the communityRelated Blogs
Multimodal deep learning is a recent trend in artificial intelligence (AI) that is revolutionizing how machines understand the real world using multiple data modalities, such as images, text, video, and audio. In particular, multiple machine learning frameworks are emerging that exploit visual representations to infer textual descriptions following Open AI’s introduction of the Contrastive Language-Image Pre-Training (CLIP) model. The improved models use more complex datasets to change the CLIP framework for use cases that are specific to the domain. They also have better state-of-the-art (SoTA) generalization performance than the models that came before them. This article discusses the benefits, challenges, and alternatives of Open AI CLIP to help you choose a model for your specific domain. The list below mentions the architectures covered: Pubmed CLIP PLIP SigLIP Street CLIP Fashion CLIP CLIP-Rscid BioCLIP CLIPBert Open AI CLIP Model CLIP is an open-source vision-language AI model by OpenAI trained using image and natural language data to perform zero-shot classification tasks. Users can provide textual captions and use the model to assign a relevant label to the query image. Open AI CLIP Model: Architecture and Development The training data consists of images from the internet and 32,768 text snippets assigned to each image as its label. The training task involves using natural language processing (NLP) to predict which label goes with which image by understanding visual concepts and relating them to the textual data. CLIP Architecture The model primarily uses an image and a text encoder that convert images and labels into embeddings. Optimization involves minimizing a contrastive loss function by computing similarity scores between these embeddings and associating the correct label with an image. See Also: What is Vector Similarity Search? Once trained, the user can provide an unseen image as input with multiple captions to the image and text encoders. CLIP will then predict the correct label that goes with the image. Benefits of OpenAI CLIP OpenAI CLIP has multiple benefits over traditional vision models. The list below mentions the most prominent advantages: Zero-shot Learning (ZSL): CLIP’s training approach allows it to label unseen images without requiring expensive training on new datasets. Like Generative Pre-trained Transformer - 3 (GPT-3) and GPT-4, CLIP can perform zero-shot classification tasks using natural language data with minimal training overhead. The property also helps users fine-tune CLIP more quickly to adapt to new tasks. Better Real-World Performance: CLIP demonstrates better real-world performance than traditional vision models, which only work well with benchmark datasets. Limitations of OpenAI CLIP Although CLIP is a robust framework, it has a few limitations, as highlighted below: Poor Performance on Fine-grained Tasks: CLIP needs to improve its classification performance for fine-grained tasks such as distinguishing between car models, animal species, flower types, etc. Out-of-Distribution Data: While CLIP performs well on data with distributions similar to its training set, performance drops when it encounters out-of-distribution data. The model requires more diverse image pre-training to generalize to entirely novel tasks. Inherent Social Bias: The training data used for CLIP consists of randomly curated images with labels from the internet. The approach implies the model learns intrinsic biases present in image captions as the image-text pairs do not undergo filtration. Due to these limitations, the following section will discuss a few alternatives for domain-specific tasks. Learn how to build visual search engines with CLIP and ChatGPT in our on-demand webinar. Alternatives to CLIP Since CLIP’s introduction, multiple vision-language algorithms have emerged with unique capabilities for solving problems in healthcare, fashion, retail, etc. We will discuss a few alternative models that use the CLIP framework as their base. We will also briefly mention their architecture, development approaches, performance results, and use cases. 1. PubmedCLIP PubmedCLIP is a fine-tuned version of CLIP for medical visual question-answering (MedVQA), which involves answering natural language questions about an image containing medical information. PubmedCLIP: Architecture and Development The model is pre-trained on the Radiology Objects in Context (ROCO) dataset, which consists of 80,000 samples with multiple image modalities, such as X-ray, fluoroscopy, mammography, etc. The image-text pairs come from Pubmed articles; each text snippet briefly describes the image’s content. PubmedCLIP Architecture Pre-training includes fine-tuning CLIP’s image and text encoders to minimize contrastive language and vision loss. The pretrained module, PubMedCLIP, and a Convolutional Denoising Image Autoencoder (CDAE) encode images. A question encoder converts natural language questions into embeddings and combines them with the encoded image through a bilinear attention network (BAN). The training objective is to map the embeddings with the correct answer by minimizing answer classification and image reconstruction loss using a CDAE decoder. Performance Results of PubmedCLIP The accuracy metric shows an improvement of 1% compared to CLIP on the VQA-RAD dataset, while PubMedCLIP with the vision transform ViT-32 as the backend shows an improvement of 3% on the SLAKE dataset. See Also: Introduction to Vision Transformers (ViT). PubmedCLIP: Use Case Healthcare professionals can use PubMedCLIP to interpret complex medical images for better diagnosis and patient care. 2. PLIP The Pathology Language-Image Pre-Training (PLIP) model is a CLIP-based framework trained on extensive, high-quality pathological data curated from open social media platforms such as medical Twitter. PLIP: Architecture and Development Researchers used 32 pathology hashtags according to the recommendations of the United States Canadian Academy for Pathology (USCAP) and the Pathology Hashtag Ontology project. The hashtags helped them retrieve relevant tweets containing de-identified pathology images and natural descriptions. The final dataset - OpenPath - comprises 116,504 image-text pairs from Twitter posts, 59,869 image-text pairs from the corresponding replies with the highest likes, and 32,041 additional image-text pairs from the internet and the LAION dataset. OpenPath Dataset Experts use OpenPath to fine-tune CLIP through an image preprocessing pipeline that involves image down-sampling, augmentations, and random cropping. Performance Results of PLIP PLIP achieved state-of-the-art (SoTA) performance across four benchmark datasets. On average, PLIP achieved an F1 score of 0.891, while CLIP scored 0.813. PLIP: Use Case PLIP aims to classify pathological images for multiple medical diagnostic tasks and help retrieve unique pathological cases through image or natural language search. New to medical imaging? Check out ‘Guide to Experiments for Medical Imaging in Machine Learning.’ 3. SigLip SigLip uses a more straightforward sigmoid loss function to optimize the training process instead of a softmax contrastive loss as traditionally used in CLIP. The method boosts training efficiency and allows users to scale the process when developing models using more extensive datasets. SigLip: Architecture and Development Optimizing the contrastive loss function implies maximizing the distance between non-matching image-text pairs while minimizing the distance between matching pairs. However, the method requires text-to-image and image-to-text permutations across all images and text captions. It also involves computing normalization factors to calculate a softmax loss. The approach is computationally expensive and memory-inefficient. Instead, the sigmoid loss simplifies the technique by converting the loss into a binary classification problem by assigning a positive label to matching pairs and negative labels to non-matching combinations. Efficient Loss Implementation In addition, permutations occur on multiple devices, with each device predicting positive and negative labels for each image-text pair. Later, the devices swap the text snippets to re-compute the loss with corresponding images. Performance Results of SigLip Based on the accuracy metric, the sigmoid loss outperforms the softmax loss for smaller batch sizes on the ImageNet dataset. Performance comparison Both losses deteriorate after a specific batch size, with Softmax performing slightly better at substantial batch sizes. SigLip: Use Case SigLip is suitable for training tasks involving extensive datasets. Users can fine-tune SigLip using smaller batch sizes for faster training. 4. StreetCLIP StreetCLIP is an image geolocalization algorithm that fine-tunes CLIP on geolocation data to predict the locations of particular images. The model is available on Hugging Face for further research. StreetCLIP: Architecture and Development The model improves CLIP zero-shot learning capabilities by training a generalized zero-shot learning (GZSL) classifier that classifies seen and unseen images simultaneously during the training process. StreetCLIP Architecture Fine-tuning involves generating synthetic captions for each image, specifying the city, country, and region. The training objective is to correctly predict these three labels for seen and unseen photos by optimizing a GZSL and a vision representation loss. Performance Results of StreetCLIP Compared to CLIP, StreetCLIP has better geolocation prediction accuracy. It outperforms CLIP by 0.3 to 2.4 percentage points on the IM2GPS and IM2GPS3K benchmarks. StreetCLIP: Use Case StreetCLIP is suitable for navigational purposes where users require information on weather, seasons, climate patterns, etc. It will also help intelligence agencies and journalists extract geographical information from crime scenes. 5. FashionCLIP FashionCLIP (F-CLIP) fine-tunes the CLIP model using fashion datasets consisting of apparel images and textual descriptions. The model is available on GitHub and HuggingFace. FashionCLIP: Architecture and Development The researchers trained the model on 700k image-text pairs in the Farfetch inventory dataset and evaluated it on image retrieval and classification tasks. F-CLIP Architecture The evaluation also involved testing for grounding capability. For instance, zero-shot segmentation assessed whether the model understood fashion concepts such as sleeve length, brands, textures, and colors. They also evaluated compositional understanding by creating improbable objects to see if F-CLIP generated appropriate captions. For instance, they see if F-CLIP can generate a caption—a Nike dress—when seeing a picture of a long dress with the Nike symbol. Performance Results of FashionCLIP F-CLIP outperforms CLIP on multiple benchmark datasets for multi-modal retrieval and product classification tasks. For instance, F-CLIP's F1 score for product classification is 0.71 on the F-MNIST dataset, while it is 0.66 for CLIP. FashionCLIP: Use Case Retailers can use F-CLIP to build chatbots for their e-commerce sites to help customers find relevant products based on specific text prompts. The model can also help users build image-generation applications for visualizing new product designs based on textual descriptions. 6. CLIP-RSICD CLIP-RSICD is a fine-tuned version of CLIP trained on the Remote Sensing Image Caption Dataset (RSICD). It is based on Flax, a neural network library for JAX (a Python package for high-end computing). Users can implement the model on a CPU. The model is available on GitHub. CLIP-RSICD: Architecture and Development The RSICD consists of 10,000 images from Google Earth, Baidu Map, MapABC, and Tianditu. Each image has multiple resolutions with five captions. RSICD Dataset Due to the small dataset, the developers implemented augmentation techniques using transforms in Pytorch’s Torchvision package. Transformations included random cropping, random resizing and cropping, color jitter, and random horizontal and vertical flipping. Performance Results of CLIP-RSICD On the RSICD test set, the regular CLIP model had an accuracy of 0.572, while CLIP-RSICD had a 0.883 accuracy score. CLIP-RSICD: Use Case CLIP-RSICD is best for extracting information from satellite images and drone footage. It can also help identify red flags in specific regions to predict natural disasters due to climate change. 7. BioCLIP BioCLIP is a foundation model for the tree of life trained on an extensive biology image dataset to classify biological organisms according to their taxonomy. BioCLIP: Architecture and Development BioCLIP fine-tunes the CLIP framework on a custom-curated dataset—TreeOfLife-10M—comprising 10 million images with 454 thousand taxa in the tree of life. Each taxon corresponds to a single image and describes its kingdom, phylum, class, order, family, genus, and species. Taxonomic Labels The CLIP model takes the taxonomy as a flattened string and matches the description with the correct image by optimizing the contrastive loss function. Researchers also enhance the training process by providing scientific and common names for a particular species to improve generalization performance. This method helps the model recognize a species through a general name used in a common language. Performance Results of BioCLIP On average, BioCLIP boosts accuracy by 18% on zero-shot classification tasks compared to CLIP on ten different biological datasets. BioCLIP: Use Case BioCLIP is ideal for biological research involving VQA tasks where experts quickly want information about specific species. Watch Also: How to Fine Tune Foundation Models to Auto-Label Training Data. 8. CLIPBert CLIPBert is a video and language model that uses the sparse sampling strategy to classify video clips belonging to diverse domains quickly. It uses Bi-directional Encoder Representations from Transformers (BERT) - a large language model (LLM), as its text encoder and ResNet-50 as the visual encoder. CLIPBert: Architecture and Development The model’s sparse sampling method uses only a few sampled clips from a video in each training step to extract visual features through a convolutional neural network (CNN). The strategy improves training speed compared to methods that use full video streams to extract dense features. The model initializes the BERT with weights pre-trained on BookCorpus and English Wikipedia to get word embeddings from textual descriptions of corresponding video clips. CLIPBert Training involves correctly predicting a video’s description by combining each clip’s predictions and comparing them with the ground truth. The researchers used 8 NVIDIA V100 GPUs to train the model on 40 epochs for four days. During inference, the model samples multiple clips and aggregates the prediction for each clip to give a final video-level prediction. Performance Results of CLIPBert CLIPBert outperforms multiple SoTA models on video retrieval and question-answering tasks. For instance, CLIPBert shows a 4% improvement over HERO on video retrieval tasks. CLIPBert: Use Case CLIPBert can help users analyze complex videos and allow them to develop generative AI tools for video content creation. See Also: FastViT: Hybrid Vision Transformer with Structural Reparameterization. . Alternatives to Open AI CLIP: Key Takeaways With frameworks like CLIP and ChatGPT, combining computer vision with NLP is becoming the new norm for developing advanced multi-modal models to solve modern industrial problems. Below are a few critical points to remember regarding CLIP and its alternatives. OpenAI CLIP Benefits: OpenAI CLIP is an excellent choice for general vision-language tasks requiring low domain-specific expertise. Limitations: While CLIP’s zero-shot capability helps users adapt the model to new tasks, it underperforms on fine-grained tasks and out-of-distribution data. Alternatives: Multiple CLIP-based options exist that are suitable for medical image analysis, biological research, geo-localization, fashion, and video understanding.
April 19
8 min
Meta has released Llama 3 pre-trained and instruction-fine-tuned language models with 8 billion (8B) and 70 billion (70B) parameters. These models have new features, like better reasoning, coding, and math-solving capabilities. They set a new state-of-the-art (SoTA) for models of their sizes that are open-source and you can use. This release builds upon the company's commitment to accessible, SoTA models. Llama 3 technology stands out because it focuses on capabilities that are tuned to specific instructions. This shows that Meta is serious about making helpful, safe AI systems that align with what users want. The Llama 3 family of models utilizes over 400 TFLOPS per GPU when trained on 16,000 GPUs simultaneously. The training runs were performed on two custom-built 24,000 GPU clusters. In this article, you will learn: What we know so far about the underlying Llama 3 architecture (surprisingly, it’s not a Mixture of Experts; MoE). Key capabilities of the multi-parameter model. Key differentiators from Llama 2 and other models. The performance on benchmarks against other SoTA models. Potential applications and use cases. How you can test it out and plug it into your application now. Here’s the TL;DR if you are pressed for time: Llama 3 models come in both pre-trained and instruction-following variants. Llama 3 promises increased responsiveness and accuracy in following complex instructions, which could lead to smoother user experiences with AI systems. The model release includes 8B, 70B, and 400B+ parameters, which allow for flexibility in resource management and potential scalability. It integrates with search engines like Google and Bing to draw on up-to-date, real-time information and augment its responses. It uses a new tokenizer with a vocabulary of 128k tokens. This enables it to encode language much more efficiently. It offers notably improved token efficiency—despite the larger 8B model, Llama 3 maintains inference efficiency on par with Llama 2 7B. Understanding the Model Architecture In addition, training the model was three times more efficient than Llama 2. In this section, you will learn the architectural components of Llama 3 that make it this efficient: Model Architecture with Improved Tokinzer Efficiency Like many SoTA LLMs, Llama 3 uses a Transformer-based architecture. This architecture allows efficient parallelization during training and inference, making it well-suited for large-scale models. Here are the key insights: Efficiency Focus: Adopting a standard decoder-only Transformer architecture prioritizes computational efficiency during inference (i.e., generating text). Vocabulary Optimization: The 128K token vocabulary offers significantly improved encoding efficiency compared to Llama 2. This means the model can represent more diverse language patterns with fewer parameters, potentially boosting performance without increasing model size. Fine-Tuning the Attention Mechanism: Grouped query attention (GQA) aims to improve inference (text generation) for the 8B and 70B parameter models. This technique could improve speed without sacrificing quality. Long Sequence Handling: Training on 8,192 token sequences focuses on processing longer text inputs. This is essential for handling complex documents, conversations, or code where context extends beyond short passages. Document Boundary Awareness: Using a mask during self-attention prevents information leakage across document boundaries. This is vital for tasks like summarizing or reasoning over multiple documents, where maintaining clear distinctions is crucial. Surprisingly, its architecture does not use Mixture-of-Experts (MoE), which is popular with most recent LLMs. Pretraining Data Composition Llama 3 was trained on over 15 trillion tokens. The pretraining dataset is more than seven times larger than Llama 2's. Here are the key insights on the pretraining data: Massive Dataset Scale: The 15T+ token dataset is a massive increase over Llama 2, implying gains in model generalization and the ability to handle more nuanced language patterns. Code Emphasis: The dataset contains four times more code samples, which improves the model’s coding abilities. Multilingual Preparation: Over 5% more non-English data than used to train Llama 2 for future multilingual applications exist. Though performance in non-English languages will likely differ initially. Quality Control Rigor: The team developed data filtering pipelines to build high-quality training data. They used heuristic filters, NSFW removal, deduplication, and classifiers to ensure model integrity and reduce potential biases. Data Mixing Experimentation: The emphasis on experimentation with varying data mixes highlights the importance of finding an optimal balance for diverse downstream use cases. This suggests Meta understands that the model will excel in different areas based on its training composition. Scaling Up Pre-training Training LLMs remains computationally expensive, even with the most efficient implementations. Training Llama 3 demanded more than better scaling laws and infrastructure; it required efficient strategies (scaling up pre-training) to achieve highly effective training time across 16,000 GPUs. Here are key insights on scaling training: Scaling Laws as Guides: Meta leans heavily on scaling laws to determine optimal data mixes and resource allocation during training. These laws aren't foolproof but likely enable more informed decision-making about model development. Continued Improvement with Massive Data: The 8B and 70B models show significant log-linear improvement up to 15T tokens. This suggests that even large models can benefit from more data, defying the notion of diminishing returns within the dataset sizes explored. Parallelization Techniques: Combining data, model, and pipeline parallelisms allowed them to efficiently train on up to 16K GPUs simultaneously. Reliability and Fault Tolerance: The automated error detection, hardware reliability focus, and scalable storage enhancements emphasize the practical realities of training huge models. 95%+ effective training time is remarkable! The team reported a 3x increase in training efficiency over Llama 2. This is remarkable and likely due to a combination of the abovementioned techniques. The most important thing to remember is that bigger models can do the same work with less computation. However, smaller models are still better because they are better at generating responses quickly. This makes choosing the right model size for the job even more important. Instruction Fine Tuning Meta's blog mentioned Llama 3 is fine-tuned in instructions-following. This likely involved specific fine-tuning techniques on datasets designed to improve the model's ability to understand and execute complex instructions. Here are key insights: Hybrid Finetuning Approach: Meta combines several techniques for instruction-tuning—supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO). This multi-pronged strategy suggests flexibility and tailoring to specific use cases. Data as the Differentiator: The emphasis is on the quality of prompts and preference rankings as prime drivers of aligned model performance. This highlights the involvement of fine-tuning techniques and data curation. Human-in-the-Loop: Multiple rounds of quality assurance on human annotations remind us that human feedback remains vital for aligning and refining these complex models. Reasoning and Coding Benefits: PPO and DPO with preference ranking data significantly boosted Llama 3's performance on reasoning and coding tasks. This underscores the power of these techniques in specific domains. Answer Selection Fine-Tuning: Intriguingly, models can sometimes 'understand' the correct answer but struggle with selection. Preference ranking training directly addresses this, teaching the model to discriminate between output possibilities. Recommended: Training vs. Fine-tuning: What is the Difference? Functional Capabilities of Llama 3 Meta's Llama 3 advancements in pretraining and instruction-focused fine-tuning offer potential across a wide range of natural language processing (NLP) and code-related tasks. Let's explore some potential functional areas: Conversational Interactions Asking for Advice: Llama 3 can provide guidance or suggestions for a problem scenario due to its instruction-following focus. Its ability to draw on knowledge from its training data could offer a variety of perspectives or solutions. Brainstorming: Llama 3's creativity and language generation capabilities could make it a helpful brainstorming partner. It can generate lists of ideas, suggest alternative viewpoints, or create out-of-the-box concept combinations to stimulate further thought. Text Analysis and Manipulation Classification: With appropriate fine-tuning, Llama 3 classifies text, code, or other data into predefined categories. Its ability to identify patterns from both its pretraining data and specific classification training could make it effective in such tasks. Closed Question Answering: Llama 3's access to real-time search results and large-scale knowledge base from its pretraining improve its potential for factual question answering. Closed-ended questions yield accurate and concise responses. Extraction: Llama 3 extracts specific information from larger text documents or code bases. Fine-tuning might identify named entities, key phrases, or relevant relationships. Code-Related Coding: Meta's attention to code within the training data suggests Llama 3 possesses coding capability. It could generate code snippets, assist with debugging, or explain existing code. Creative and Analytical Creative Writing: Llama 3's generative abilities open possibilities for creative text formats, such as poems, stories, or scripts. Users might provide prompts, outlines, or stylistic guidelines to shape the output. Extraction: Llama 3 extracts specific information from larger text documents or code bases. Fine-tuning might identify named entities, key phrases, or relevant relationships. Inhabiting a Character/Persona: Though not explicitly stated, Llama 3's generative and knowledge-accessing capabilities indicate the potential for adopting specific personas or character voices. This could be entertaining or useful for simulating specific conversational styles. Open Question-Answering: Answering complex, open-ended questions thoroughly and accurately could be more challenging. However, its reasoning skills and access to external knowledge might offer insightful and nuanced responses. Reasoning: The emphasis on preference-ranking-based fine-tuning suggests advancements in reasoning. Llama 3 can analyze arguments, explain logical steps, or solve multi-part problems. Rewriting: Llama 3 could help rephrase text for clarity, alter the tone, or change writing styles. You must carefully define their rewriting goals for the most successful results. Summarization: Llama 3's ability to process long input sequences and fine-tuned understanding of instructions position it well for text summarization. It might condense articles, reports, or meeting transcripts into key points. Model Evaluation Performance Benchmarking (Comparison: Gemma, Gemini, and Claude 3) The team evaluated the models' performance on standard benchmarks and tried to find the best way to make them work in real-life situations. They created a brand-new, high-quality set of human evaluations to do this. This test set has 1,800 questions that cover 12 main use cases: asking for help, coming up with ideas, sorting, answering closed questions, coding, creative writing, extraction, taking on the role of a character or persona, answering open questions, reasoning, rewriting, and summarizing. Llama 3 70B broadly outperforms Gemini Pro 1.5 and Claude 3 Sonnet. It is a bit behind on MATH, which Gemini Pro 1.5 seems better at. But it is small enough to host at scale without breaking the bank. Here’s the performance benchmark for the instruction-following model: Meta Llama 3 Instruct model performance. Meta Llama 3 Pre-trained model performance. Let’s look at some of these benchmarks. MMLU (Knowledge Benchmark) The MMLU benchmark assesses a model's ability to understand and answer questions that require factual and common-sense knowledge. The 8B model achieves a score of 66.6, outperforming the published Mistral 7B (63.9) and measured Gemma 7B (64.4) models. The 70B model achieves an impressive score of 79.5, outperforming the published Gemini Pro 1.0 (71.8) and measured Mistral 8x22B (77.7) models. The high scores suggest Llama 3 can effectively access and process information from the real world through search engine results, complementing the knowledge gained from its massive training dataset. AGIEval The AGIEval measures performance on various English-language tasks, including question-answering, summarization, and sentiment analysis. In a 3-shot setting, the 8B model scores 45.9, slightly higher than the published Gemma 7B (44.0) but lower than the measured version (44.9). The 70B model's score of 63.0 outperforms the measured Mistral 8x22B (61.2). ARC (Skill Acquisition Benchmark) The ARC benchmark assesses a model's ability to reason and acquire new skills. In a 3-shot setting with a score of 78.6, the 8B model performs better than the published Gemma 7B (78.7) but slightly worse than the measured version (79.1). The 70B model achieves a remarkable score of 93.0, significantly higher than the measured Mistral 8x22B (90.7). The high scores suggest Llama 3 has explicitly been enhanced for these capabilities through preference-ranking techniques during fine-tuning. DROP (Model Reasoning Benchmark) This benchmark focuses on a model's ability to perform logical reasoning tasks based on textual information, often involving numerical reasoning. In a 3-shot setting, Llama 8B scores 58.4 F1, higher than the published Gemma 7B (54.4) but lower than the measured version (56.3). With a score of 79.7 (variable-shot), the Llama 70B model outperforms both the published Gemini Pro 1.0 (74.1) and the measured Mistral 8x22B (77.6). While DROP can be challenging for LLMs, Llama 3's performance suggests it can effectively handle some numerical reasoning tasks. Overall, the test results show that Meta's Llama 3 models, especially the bigger 70B version, do better than other SoTA models on various tasks related to language understanding and reasoning. Responsible AI In addition to Llama 3, the team released new Meta Llama trust & safety tools featuring Llama Guard 2, Code Shield, and Cybersec Eval 2—plus an updated Responsible Use Guide & Getting Started Guide, new recipes, and more. We will learn some of the approaches Meta used to test and secure Llama 3 against adversarial attacks. A system-level approach to responsibility in Llama 3. System-level Approach Responsible Development of LLMs: Meta emphasizes a holistic view of responsibility, going beyond just the core model to encompass the entire system within which an LLM operates. Responsible Deployment of LLMs: Developers building applications with Llama 3 are seen as sharing responsibility for ethical use. Meta aims to provide tools and guidance to facilitate this. Instruction Fine-tuning: Fine-tuning with an emphasis on safety plays a crucial role in aligning the model with responsible use guidelines and minimizing potential harms. Red Teaming Approach Human Experts: Involvement of human experts in the red teaming process suggests an understanding that automated methods alone may not catch all the nuances of potential misuse. Automation Methods: These methods are vital for scaling the testing process and generating a wide range of adversarial prompts to stress-test the model. Adversarial Prompt Generation: The focus on adversarial prompts highlights Meta's proactive approach to identifying potential vulnerabilities and safety concerns before wider deployment. Trust and Safety Tools Llama Guard 2, Code Shield, and CyberSec Eval 2: Development of specialized tools demonstrates a focus on mitigating specific risks: - Llama Guard 2: Proactive prompt and output safety filtering aligns with industry-standard taxonomies for easier adoption. - Code Shield: Addresses security vulnerabilities unique to LLMs with code generation capabilities. - CyberSecEval 2: Focuses on assessing and mitigating cybersecurity-related risks associated with LLMs. Llama 3 Trust and Safety Tools. Responsible Use Guide (RUG) Responsible Development with LLMs: Updated guidance reinforces Meta's commitment to providing developers with resources for ethical application building. Content Moderation APIs: Explicitly recommending the use of external content moderation tools suggests a multi-pronged approach to safety. Developers are encouraged to utilize existing infrastructure to complement Meta's own efforts. You can find more of these updates on the Llama website. Llama 3: Model Availability Meta's commitment to open-sourcing Llama 3 expands its accessibility and potential for broader impact. The model is expected to be available across various platforms, making it accessible to researchers, developers, and businesses of varying sizes. Cloud Providers Major cloud providers are partnering with Meta to offer Llama 3 integration, making it widely accessible: AWS, Databricks, Google Cloud, and Microsoft Azure: These platforms provide scalable infrastructure, tools, and pre-configured environments that simplify model deployment and experimentation. NVIDIA NIM and Snowflake: NVIDIA also provides services for deploying and using Llama 3. Model API Providers Hugging Face: These platforms are popular for model sharing and experimentation. Llama 3 is already available as a GGUF version and other platform variations. Ollama: The Ollama community has also integrated the model's different parameters and variations into its library, which has over 15k downloads. Llama 3: What’s Next? Meta's announcements reveal an exciting and ambitious future for the Llama 3 series of LLMs. Some of the main areas of focus point to a model with a lot more capabilities and reach: Scaling and Expansion Larger Models: Meta is currently developing larger Llama 3 models in the 400B+ parameter range, suggesting its ambition to push the boundaries of LLM capabilities further. Multimodality: Planned features include the ability to process and generate text and other modalities, such as images and audio. This could greatly expand the use cases of Llama 3. Multilingualism: The goal to make Llama 3 conversant in multiple languages aligns with Meta's global focus, opening up possibilities for cross-lingual interactions and applications. Longer Context Window: Increasing the amount of text the model can process at once would enable Llama 3 to handle more complex tasks, improving its understanding of extended conversations, intricate documents, and large codebases. Enhanced Capabilities: An overall emphasis on improving capabilities hints at potential advancements in reasoning, problem-solving, and coding that may exceed the impressive performance of currently released models. Research Transparency Research Paper: Meta plans to publish a detailed research paper after completing the training process for larger Llama 3 models. This commitment to transparency and knowledge-sharing aligns with their open-source philosophy. Focus on Accessibility and Real-World Impact Wider Platform Availability: Collaboration with cloud providers, hardware companies, and hosting platforms seeks to make the model readily accessible across various resources. This focus could encourage wider experimentation and adoption for various use cases. Open-Source Commitment: Meta encourages community involvement and seeks accelerated development progress, underscoring its belief that open-source drives innovation and safety. Want to experience Llama 3 right now? Starting today, our latest models have been integrated into Meta AI, which is now rolling out to even more countries, available across our family of apps, and having a new home on the web. See the model card here Experience it on meta.ai Llama 3: Key Takeaways Awesome! Llama 3 is already a game-changer for the open-source community. Let’s summarize the key takeaways for Llama 3, focusing on its significance and potential impact on the LLM landscape: Breakthrough in Performance: Meta's claim that Llama 3 sets a new standard for 8B and 70B parameter models suggests a big improvement in LLM's abilities in those size ranges. Focus on Accessibility: Llama 3's open-sourcing, wide platform availability, and partnerships with major technology providers make it a powerful tool accessible to a much wider range of individuals and organizations than similar models. Real-World Emphasis: Meta's use of custom human evaluation sets and focus on diverse use cases indicates they actively work to make Llama 3 perform well in situations beyond theoretical benchmarks. Ambitious Trajectory: Ongoing training of larger models, exploration of multimodality, and multilingual development showcase Meta's ambition to continuously push the boundaries of what LLMs can do. Emphasis on Instruction-Following: Llama 3's refinement in accurately following complex instructions could make it particularly useful for creating more user-friendly and adaptable AI systems.
April 19
5 min
What is MM1? MM1 is a family of large multimodal language models that combines text and image understanding. It boasts an impressive 30 billion parameters and excels in both pre-training and supervised fine-tuning. MM1 generates and interprets both images and text data, making it a powerful tool for various multimodal tasks. Additionally, it incorporates a mixture-of-experts (MoE) architecture, contributing to its state-of-the-art performance across benchmarks. Introduction to Multimodal AI Multimodal AI models are a type of artificial intelligence model that can process and generate multiple types of data, such as text, images, and audio. These models are designed to understand the world in a way that is closer to how humans do, by integrating information from different modalities. Multimodal AI models typically use a combination of different types of AI systems, each designed to process a specific type of data. For example, a multimodal AI model might use a convolutional neural network (CNN) to process visual data, a recurrent neural network (RNN) to process text data, and a transformer model to integrate the information from CNN and RNN. The outputs of these networks are then combined, often using techniques such as concatenation or attention mechanisms, to produce a final output. This output can be used for a variety of tasks, such as classification, generation, or prediction. Overview of Multimodal Large Language Models (MLLMs) Multimodal Large Language Models (MLLMs) are generative AI systems that combine different types of information, such as text, images, videos, audio, and sensory data, to understand and generate human-like language. These models revolutionize the field of natural language processing (NLP) by going beyond text-only models and incorporating a wide range of modalities. Here's an overview of key aspects of Multimodal Large Language Models: Architecture MLLMs typically extend architectures like Transformers, which have proven highly effective in processing sequential data such as text. Transformers consist of attention mechanisms that enable the model to focus on relevant parts of the input data. In MLLMs, additional layers and mechanisms are added to process and incorporate information from other modalities. Integration of Modalities MLLMs are designed to handle inputs from multiple modalities simultaneously. For instance, they can analyze both the text and the accompanying image in a captioning task or generate a response based on both text and audio inputs. This integration allows MLLMs to understand and generate content that is richer and more contextually grounded. Pre-Training Like their unimodal counterparts, MLLMs are often pre-trained on large datasets using self-supervised learning objectives. Pre-training involves exposing the model to vast amounts of multimodal data, allowing it to learn representations that capture the relationships between different modalities. Pre-training is typically followed by fine-tuning on specific downstream tasks. State-of-the-Art Models CLIP (Contrastive Language-Image Pre-training): Developed by OpenAI, CLIP learns joint representations of images and text by contrasting semantically similar and dissimilar image-text pairs. GPT-4: It showcases remarkable capabilities in complex reasoning, advanced coding, and even performs well in multiple academic exams. Kosmos-1: Created by Microsoft, this MLLM os trained from scratch on web-scale multimodal corpora, including arbitrary interleaved text and images, image-caption pairs, and text data. PaLM-E: Developed by Google, PaLM-E integrates different modalities to enhance language understanding. Understanding MM1 Models MM1 represents a significant advancement in the domain of Multimodal Large Language Models (MLLMs), demonstrating state-of-the-art performance in pre-training metrics and competitive results in various multimodal benchmarks. The development of MM1 stems from a meticulous exploration of architecture components and data choices, aiming to distill essential design principles for building effective MLLMs. MM1 Model Experiments: Key Research Findings Architecture Components Image Encoder: The image encoder's design, along with factors such as image resolution and token count, significantly impacts MM1's performance. Through careful ablations, it was observed that optimizing the image encoder contributes substantially to MM1's capabilities. Vision-Language Connector: While important, the design of the vision-language connector was found to be of comparatively lesser significance compared to other architectural components. It plays a crucial role in facilitating communication between the visual and textual modalities. Data Choices Pre-training Data: MM1 leverages a diverse mix of image-caption, interleaved image-text, and text-only data for pre-training. This combination proved pivotal in achieving state-of-the-art few-shot results across multiple benchmarks. The study highlights the importance of different types of pre-training data for various tasks, with caption data being particularly impactful for zero-shot performance. Supervised Fine-Tuning (SFT): The effectiveness of pre-training data choices was validated through SFT, where capabilities and modeling decisions acquired during pre-training were retained, leading to competitive performance across evaluations and benchmarks. Performance In-Context Learning Abilities: The MM1 model exhibits exceptional in-context learning abilities, particularly in its largest 30 billion parameter configuration. This version of the model can perform multi-step reasoning over multiple images using few-shot “chain-of-thought” prompting. Model Scale: MM1's scalability is demonstrated through the exploration of larger LLMs, ranging from 3B to 30B parameters, and the investigation of mixture-of-experts (MoE) models. This scalability contributes to MM1's adaptability to diverse tasks and datasets, further enhancing its performance and applicability. Performance: The MM1 models, which include both dense models and mixture-of-experts (MoE) variants, achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Apple MM1 Model’s Features In-Context Predictions The Apple MM1 model excels at making predictions within the context of a given input. By considering the surrounding information, it can generate more accurate and contextually relevant responses. For instance, when presented with a partial sentence or incomplete query, the MM1 model can intelligently infer the missing parts and provide meaningful answers. Multi-Image Reasoning The MM1 model demonstrates impressive capabilities in reasoning across multiple images. It can analyze and synthesize information from various visual inputs, allowing it to make informed decisions based on a broader context. For example, when evaluating a series of related images (such as frames from a video), the MM1 model can track objects, detect changes, and understand temporal relationships. Chain-of-Thought Reasoning One of the standout features of the MM1 model is its ability to maintain a coherent chain of thought. It can follow logical sequences, connect ideas, and provide consistent responses even in complex scenarios. For instance, when engaged in a conversation, the MM1 model remembers previous interactions and ensures continuity by referring back to relevant context. Few-Shot Learning with Instruction Tuning The MM1 model leverages few-shot learning techniques, enabling it to learn from a small amount of labeled data. Additionally, it fine-tunes its performance based on specific instructions, adapting to different tasks efficiently. For instance, if provided with only a handful of examples for a new task, the MM1 model can generalize and perform well without extensive training data. Visual Question Answering (VQA) The MM1 model can answer questions related to visual content through Visual Question Answering (VQA). Given an image and a question, it generates accurate and context-aware answers, demonstrating its robust understanding of visual information. For example, when asked, “What is the color of the car in the picture?” the MM1 model can analyze the image and provide an appropriate response. Captioning When presented with an image, the MM1 model can generate descriptive captions. Its ability to capture relevant details and convey them in natural language makes it valuable for image captioning tasks. For instance, if shown a picture of a serene mountain landscape, the MM1 model might generate a caption like, “Snow-capped peaks against a clear blue sky.” For more information, read the paper of Arxiv published by Apple researchers: MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training. Key Components of MM1 Transformer Architecture The transformer architecture serves as the backbone of MM1. Self-Attention Mechanism: Transformers use self-attention to process sequences of data. This mechanism allows them to weigh the importance of different elements within a sequence, capturing context and relationships effectively. Layer Stacking: Multiple layers of self-attention are stacked to create a deep neural network. Each layer refines the representation of input data. Positional Encoding: Transformers incorporate positional information, ensuring they understand the order of elements in a sequence. Multimodal Pre-Training Data MM1 benefits from a diverse training dataset: Image-Text Pairs: These pairs directly connect visual content (images) with corresponding textual descriptions. The model learns to associate the two modalities. Interleaved Documents: Combining images and text coherently allows MM1 to handle multimodal inputs seamlessly. Text-Only Data: Ensuring robust language understanding, even when dealing with text alone. Image Encoder The image encoder is pivotal for MM1’s performance: Feature Extraction: The image encoder processes visual input (images) and extracts relevant features. These features serve as the bridge between the visual and textual modalities. Resolution and Token Count: Design choices related to image resolution and token count significantly impact MM1’s ability to handle visual information. Vision-Language Connector The vision-language connector facilitates communication between textual and visual representations: Cross-Modal Interaction: It enables MM1 to align information from both modalities effectively. Joint Embeddings: The connector generates joint embeddings that capture shared semantics. Ablation Study for MLLMs Building performant Multimodal Large Language Models (MLLMs) is an empirical process that involves carefully exploring various design decisions related to architecture, data, and training procedures. Here, the authors present a detailed ablation study conducted to identify optimal configurations for constructing a high-performing model, referred to as MM1. The ablations are performed along three major axes: MM1 Model Ablations Different pre-trained image encoders are investigated, along with various methods of connecting Large Language Models (LLMs) with these encoders. The architecture exploration encompasses the examination of the image encoder pre-training objective, image resolution, and the design of the vision-language connector. MM1 Model Ablation MM1 Data Ablations Various types of data and their relative mixture weights are considered, including captioned images, interleaved image-text documents, and text-only data. The impact of different data sources on zero-shot and few-shot performance across multiple captioning and Visual Question Answering (VQA) tasks is evaluated. Data Ablation Study for MM1 Training Procedure Ablations The training procedure is explored, including hyperparameters and which parts of the model to train at different stages. Two types of losses are considered: contrastive losses (e.g., CLIP-style models) and reconstructive losses (e.g., AIM), with their effects on downstream performance examined. Empirical Setup A smaller base configuration of the MM1 model is used for ablations, allowing for efficient assessment of model performance. The base configuration includes an Image Encoder (ViT-L/14 model trained with CLIP loss on DFN-5B and VeCap-300M datasets), Vision-Language Connector (C-Abstractor with 144 image tokens), Pre-training Data (mix of captioned images, interleaved image-text documents, and text-only data), and a 1.2B transformer decoder-only Language Model. Zero-shot and few-shot (4- and 8-shot) performance on various captioning and VQA tasks are used as evaluation metrics. MM1 Ablation Study: Key Findings Image resolution, model size, and training data composition are identified as crucial factors affecting model performance. The number of visual tokens and image resolution significantly impact the performance of the Vision-Language Connector, while the type of connector has a minimal effect. Interleaved data is crucial for few-shot and text-only performance, while captioning data enhances zero-shot performance. Text-only data helps improve few-shot and text-only performance, contributing to better language understanding capabilities. Careful mixture of image and text data leads to optimal multimodal performance while retaining strong text performance. Synthetic caption data (VeCap) provides a notable boost in few-shot learning performance. Performance Evaluation of MM1 Models The performance evaluation of MM1 models encompasses several key aspects, including scaling via Mixture-of-Experts (MoE), supervised fine-tuning (SFT) experiments, impact of image resolution, pre-training effects, and qualitative analysis. Scaling via Mixture-of-Experts (MoE) MM1 explores scaling the dense model by incorporating more experts in the Feed-Forward Network (FFN) layers of the language model. Two MoE models are designed: 3B-MoE with 64 experts and 7B-MoE with 32 experts, utilizing top-2 gating and router z-loss terms for training stability. The MoE models demonstrate improved performance over their dense counterparts across various benchmarks, indicating the potential of MoE for further scaling. Supervised Fine-Tuning Experiments Supervised Fine-Tuning (SFT) is performed on top of the pre-trained MM1 models using a diverse set of datasets, including instruction-response pairs, academic task-oriented vision-language datasets, and text-only data. MM1 models exhibit competitive performance across 12 benchmarks, showing particularly strong results on tasks such as VQAv2, TextVQA, ScienceQA, and newer benchmarks like MMMU and MathVista. The models maintain multi-image reasoning capabilities even during SFT, enabling few-shot chain-of-thought reasoning. Impact of Image Resolution Higher image resolution leads to improved performance, supported by methods such as positional embedding interpolation and sub-image decomposition. MM1 achieves a relative performance increase of 15% by supporting an image resolution of 1344×1344 compared to a baseline model with an image resolution of 336 pixels. Pre-Training Effects Large-scale multimodal pre-training significantly contributes to the model's performance improvement over time, showcasing the importance of pre-training data quantity. MM1 demonstrates strong in-context few-shot learning and multi-image reasoning capabilities, indicating the effectiveness of large-scale pre-training for enhancing model capabilities. Qualitative Analysis Qualitative examples provided in the evaluation offer further insights into MM1's capabilities, including single-image and multi-image reasoning, as well as few-shot prompting scenarios. These examples highlight the model's ability to understand and generate contextually relevant responses across various tasks and input modalities. Apple’s Ethical Guidelines for MM1 Privacy and Data Security: Apple places utmost importance on user privacy. MM1 models are designed to respect user data and adhere to strict privacy policies. Any data used for training is anonymized and aggregated. Bias Mitigation: Apple actively works to reduce biases in MM1 models. Rigorous testing and monitoring are conducted to identify and rectify any biases related to gender, race, or other sensitive attributes. Transparency: Apple aims to be transparent about the capabilities and limitations of MM1. Users should have a clear understanding of how the model works and what it can and cannot do. Fairness: MM1 is trained on diverse data, but Apple continues to improve fairness by addressing underrepresented groups and ensuring equitable outcomes. Safety and Harm Avoidance: MM1 is designed to avoid harmful or unsafe behavior. It refrains from generating content that could cause harm, promote violence, or violate ethical norms. Human Oversight: Apple maintains a strong human-in-the-loop approach. MM1 models are continuously monitored, and any problematic outputs are flagged for review. MM1 MLLM: Key Takeaways Multimodal Integration: MM1 combines textual and visual information, achieving impressive performance. Ablation Study Insights: Image encoder matters, connector less so. Data mix is crucial. Scaling Up MM1: Up to 30 billion parameters, strong pre-training metrics, competitive fine-tuning. Ethical Guidelines: Privacy, fairness, safety, and human oversight are priorities.
March 26
10 min
Software To Help You Turn Your Data Into AI
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.