Back to Blogs

5 Recent AI Research Papers

September 12, 2023
5 mins
blog image

3D Gaussian Splatting for Real-Time Radiance Field Rendering

The paper presents a novel real-time radiance field rendering technique using 3D Gaussian splatting, addressing the challenge of efficient and high-quality rendering.

Objective: Develop a real-time radiance field rendering technique

Problem: Achieving real-time display rates for rendering unbounded and complete scenes at 1080p resolution using Radiance Field methods. Existing approaches often involve costly neural network training and rendering or sacrifice quality for speed, making it difficult to attain both high visual quality and real-time performance for such scenes. 

Solution

  • Anisotropic 3D Gaussians as a high-quality, unstructured representation of radiance fields
  • An optimization technique for 3D Gaussian properties, coupled with adaptive density control, to generate top-tier representations for captured scenes
  • A fast, differentiable GPU-based rendering approach that incorporates visibility awareness, enables anisotropic splatting and supports swift backpropagation to accomplish exceptional quality view synthesis.

Methodology 

Scene Representation with 3D Gaussians:

  • Begin with sparse points obtained during camera calibration.
  • Utilize 3D Gaussians to represent the scene.
  • Preserve key characteristics of continuous volumetric radiance fields.
  • Avoid unnecessary computations in empty areas of the scene.

Optimization and Density Control of 3D Gaussians:

  • Implement interleaved optimization and density control for the 3D Gaussians.
  • Focus on optimizing the anisotropic covariance to achieve precise scene representation.
  • Fine-tune Gaussian properties to enhance accuracy.
  • Fast Visibility-Aware Rendering Algorithm.

Develop a rapid rendering algorithm designed for GPUs:

  • Ensure visibility awareness in the rendering process.
  • Enable anisotropic splatting for improved rendering quality.
  • Accelerate training processes.
  • Facilitate real-time rendering for efficient visualization of the radiance field.

Find the code implementation on GitHub.
 

Results

blog_image_2566

3D Gaussian Splatting for Real-Time Radiance Field Rendering 

  • Achieved real-time rendering of complex radiance fields, allowing for interactive and immersive experiences.
  • Demonstrated significant improvements in rendering quality and performance compared to previous methods like InstantNGP and Plenoxels.
  • Showcased the adaptability of the system through dynamic level-of-detail adjustments, maintaining visual fidelity while optimizing resource usage.
  • Validated the effectiveness of 3D Gaussian splatting in handling radiance field rendering challenges.

Read the original paper by Bernhard Kerbl, Georgios, Kopanas, Thomas Lemkühler: 3D Gaussian Splatting for Real-Time Radiance Field Rendering
 

Nougat: Neural Optical Understanding for Academic Documents

Nougat aims to enhance the accessibility of scientific knowledge stored in digital documents, especially PDFs, by proposing Nougat, which performs OCR tasks. It is an academic document PDF parser that understands LaTeX math and tables.

Objective: Enhance the accessibility of scientific knowledge stored in digital documents, particularly in PDF format. 

Problem: Effectively preserving semantic information, particularly mathematical expressions while converting PDF-based documents into a machine-readable format (LaTex).

Solution: Nougat is a vision transformer that enables end-to-end training for the task at hand. This architecture builds upon the Donut architecture and does not require any OCR-related inputs or modules, as the text is recognized implicitly by the network.

blog_image_4677

Nougat: Neural Optical Understanding for Academic Documents

Methodology

Encoder:

  • Receives document image.
  • Crops margins and resizes the image to a fixed rectangle.
  • Utilizes a Swin Transformer, splitting the image into windows and applying self-attention layers.
  • Outputs a sequence of embedded patches.

Decoder:

  • Inputs the encoded image.
  • Uses a transformer decoder architecture with cross-attention.
  • Generates tokens in an auto-regressive manner.
  • Projects the output to match the vocabulary size.

Implementation:

  • Adopts mBART decoder from BART.
  • Utilizes a specialized tokenizer for scientific text, similar to Galactica’s approach.

Find the code implementation on GitHub.
 

Results

  • Mathematical expressions had the lowest agreement with the ground truth, mainly due to missed formulas by GROBID and challenges in equation prediction accuracy stemming from bounding box quality.
  • Nougat, both in its small and base versions, consistently outperformed the alternative approach across all metrics, demonstrating its effectiveness in converting document images to compatible markup text.

Read the original paper from Meta AI by Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic: Nougat: Neural Optical Understanding for Academic Documents.
 

Scaling up GANs for Text-to-Image Synthesis

The paper introduces GigaGAN, a highly scalable GAN-based generative model for text-to-image synthesis, achieving exceptional scale, speed, and controllability compared to previous models.

Objective: Alternative to auto-regressive and diffusion models for text-to-image synthesis.

Problem: Making GANs more scalable and efficient in handling large datasets and generating high-quality, high-resolution images while maintaining stability and enabling fine-grained control over the generative process.

Solution: GANs reintroduced as a multi-scale training scheme aim to improve the alignment between images and text descriptions and enhance the generation of low-frequency details in the output images.

Methodology 

The GigaGAN architecture consists of the following -

Generator:

  • Text Encoding branch: utilizes a pre-trained CLIP model to extract text embeddings and a learned attention layer.
  • Style mapping network: produces a style vector similar to StyleGAN
  • Synthesis Network: uses style vector as modulation and text embeddings as attention to create an image pyramid
  • Sample-adaptive kernel selection: chooses convolution kernels based on input text conditioning

Discriminator:

  • The image branch of the discriminator makes independent predictions for each scale within the image pyramid.
  • The text branch handles text in a manner similar to the generator, while the image branch operates on an image pyramid, providing predictions at multiple scales.

Find the code for evaluation on GitHub.
 

Results

Scale Advancement:

  • GigaGAN is 36 times larger in terms of parameter count than StyleGAN2.
  • It is 6 times larger than StyleGAN-XL and XMC-GAN.

Quality Performance:

  • Despite its impressive scale, GigaGAN does not show quality saturation concerning model size.
  • Achieves a zero-shot FID (Fréchet Inception Distance) of 9.09 on the COCO2014 dataset, which is lower than DALL·E 2, Parti-750M, and Stable Diffusion.

Efficiency:

  • GigaGAN is orders of magnitude faster at image generation, taking only 0.13 seconds to generate a 512px image.

High-Resolution Synthesis:

  • It can synthesize ultra-high-resolution images at 4k resolution in just 3.66 seconds.

Latent Vector Control:

  • GigaGAN offers a controllable latent vector space, enabling various well-studied controllable image synthesis applications, including style mixing, prompt interpolation, and prompt mixing.

Read the original paper by Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park: Scaling up GANs for Text-to-Image Synthesis
 

Code Llama: Open Foundation Models for Code

Code Llama is a cutting-edge code-specialized language model, forged through extended training on code-specific datasets, delivering enhanced coding capabilities and support for a range of programming languages.

Objective: Build a large language model (LLM) that can use text prompts to generate and discuss code.

Problem: A specialized language model for code generation and understanding, with focus on performance, context handling, infiling, instruction following

Solution: The proposed solution is Code Llama which is available as three variants:

  • Code Llama: foundational code model
  • Code Llama-Python specialized: for Python
  • Code Llama-Instruct: fine-tuned model for understanding natural language instructions

Methodology

  • Code Llama is a specialized model built upon Llama 2.
  • It was developed by extended training on code-specific datasets, including increased data sampling and longer training.

Find the code for implementation on GitHub.
 

Results

  • Code Llama achieves state-of-the-art performance among open models on several code benchmarks: Scores of up to 53% on HumanEval and scores of up to 55% on MBPP.
  • Code Llama - Python 7B outperforms Llama 2 70B on both HumanEval and MBPP benchmarks.
  • All variants of Code Llama models outperform every other publicly available model on the MultiPL-E benchmark.

Read the original paper by Meta AI: Code Llama: Open Foundation Models for Code.
 

FaceChain: A Playground for Identity-Preserving Portrait Generation

FaceChain is a personalized portrait generation framework that combines advanced LoRA models and perceptual understanding techniques to create your Digital-Twin.

Objective: A personalized portrait generation framework that generates images from a limited set of input images.

Problem: The limitations of existing personalized image generation solutions, including the inability to accurately capture key identity characteristics and the presence of defects like warping, blurring, or corruption in the generated images.

Solution: FaceChain is a framework designed to preserve the unique characteristics of faces while offering versatile control over stylistic elements in image generation.

FaceChain is the integration of two LoRA models into the Stable Diffusion model. This integration endows the model with the capability to simultaneously incorporate personalized style and identity information, addressing a critical challenge in image generation.

Methodology

  • Integration of LoRA Models: FaceChain incorporates LoRA models to improve the stability of stylistic elements and maintain consistency in preserving identity during text-to-image generation.
  • Style and Identity Learning: Two LoRA models are used, namely the style-LoRA model and face-LoRA model. The style-LoRA model focuses on learning information related to portrait style, while the face-LoRA model focuses on preserving human identities.
  • Separate Training: These two models are trained separately. The style-LoRA model is trained offline, while the face-LoRA model is trained online using user-uploaded images of the same human identity.
  • Quality Control: To ensure the quality of input images for training the face-LoRA model, FaceChain employs a set of face-related perceptual understanding models. These models normalize the uploaded images, ensuring they meet specific quality standards such as appropriate size, good skin quality, correct orientation, and accurate tags.
  • Weighted Model Integration: During inference, the weights of multiple LoRA models are merged into the Stable Diffusion model to generate personalized portraits.
  • Post-Processing: The generated portraits undergo a series of post-processing steps to further enhance their details and overall quality.

Find the code implementation on GitHub.
 

Results

blog_image_15264

FaceChain: A Playground for Identity-Preserving Portrait Generation

 

encord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started
Written by
author-avatar-url

Akruti Acharya

View more posts

Explore our products