P.A. Minerva

6 May 2023

vk01.E - Hello Frame Buffering

by P. A. Minerva


Image


1 - Introduction

In previous tutorials, we used the CPU and GPU in a sequential, suboptimal way. Despite having multiple command buffers and images in the swapchain, we only created one frame at a time on the CPU timeline and waited for the GPU to complete it before creating the next frame. Now, we can improve performance and efficiency by unleashing parallelism between the CPU and GPU. This means that the CPU can create frames in advance while the GPU renders frames from its queues, reducing idle time for both the CPU and GPU.



2 - CPU-GPU parallelism and frame resources

To create frames in advance on the CPU timeline, we need to preserve resources of pending frames. For example, we usually have multiple images in a swapchain that can be used in different command buffers as render targets for rendering operations. As a result, we can record and submit more than one command buffer at a time on the CPU timeline without interfering with command buffers and images associated with frames already queued for processing. Additionally, we need multiple copies of all those resources (buffers, descriptors, synchronization objects, etc.) that our application can access with write operations to create a frame. This way, frame creation won’t interfere with other pending frames in GPU queues. That is, resources that will be used on the GPU timeline to render pending frames are preserved during the creation of a new frame on the CPU timeline.

However, observe that we cannot simply record command buffers indefinitely. Eventually, all available command buffers could be queued and we would require a synchronization mechanism to wait on the CPU timeline for a frame to be completed by the GPU. This ensures that we can reuse all resources associated with that frame. This is where synchronization objects, such as semaphores and fences, come in handy. Once the GPU completes rendering on a frame, the corresponding synchronization object will be signaled, allowing the application to know when it is safe to reuse the corresponding resources to create a new frame on the CPU timeline.

Suppose we have two command buffers and two images in the swapchain, which we’ll call A and B. After recording the commands to create the first frame on A, we submit the corresponding command buffer to a GPU queue. At that point, we can immediately start recording the commands to create the second frame on B in the other command buffer and submit the result to a GPU queue. Then, we need to wait for the GPU to complete rendering operations on A to be sure that the commands in the first command buffer have been executed. If we wait on a fence or semaphore that is signaled for this purpose, we can synchronize the command buffers to be reused on the CPU timeline for recording commands - this is valid for all frame resources as well.
Additional details will be provided in the last section of this tutorial, when we review the source code of the sample.



3 - Present latency

While it may seem advantageous to create as many frames as possible in advance on the CPU timeline, there is a downside to this approach. Specifically, creating too many frames in advance can increase the present or frame latency - the time it takes for a frame to be displayed on the screen after it has been created on the CPU timeline. This means that there is a trade-off between reducing idle time on the CPU and GPU by creating frames in advance and minimizing frame latency to ensure a responsive and smooth user experience. It is important to carefully balance these two factors to achieve optimal performance and efficiency in our applications.



4 - VKHelloFramBuffering: Code review

In the first place, we need an array for each resource that can be modified during the creation of a frame.


struct FrameResources {
    std::vector<VkCommandBuffer>         GraphicsCommandBuffers;
    std::vector<BufferParameters>        HostVisibleBuffers;
    std::vector<VkDescriptorSet>         DescriptorSets;
    std::vector<VkSemaphore>             ImageAvailableSemaphores;
    std::vector<VkSemaphore>             RenderingFinishedSemaphores;
    std::vector<VkFence>                 Fences;
};


In the VKSample class, we define a m_frameIndex variable to index into the above vectors in order to access the resources of the current frame - the frame we are creating on the CPU timeline. MAX_FRAME_LAG defines the maximum number of frames we want to queue before waiting on the CPU timeline - this determines the maximum number of pending frames. Typically, queuing one frame ahead of the GPU is sufficient to minimize present latency while reducing idle time for both the GPU and CPU.


// Max number of frames to queue
#define MAX_FRAME_LAG 2


class VKSample
{
public:

    // ...


protected:

    // ...

    virtual void CreateSynchronizationObjects();
    virtual void AllocateCommandBuffers();

    // ...

    // Index of the current frame
    uint32_t m_frameIndex = 0;


    // ...

};


CreateSynchronizationObjects and AllocateCommandBuffers need significant changes compared to the previous tutorial because we need to create/allocate MAX_FRAME_LAG resources per type for each frame we can queue on the CPU timeline. This includes command buffers and synchronization objects (fence, wait and signal semaphores). Observe that fences are created in a signaled state, and the reason for this will become clear shortly.


void VKSample::AllocateCommandBuffers()
{
    if (!m_sampleParams.GraphicsCommandPool)
    {
        VkCommandPoolCreateInfo cmdPoolInfo = {};
        cmdPoolInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;
        cmdPoolInfo.queueFamilyIndex = m_vulkanParams.GraphicsQueue.FamilyIndex;
        cmdPoolInfo.flags = VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT;
        VK_CHECK_RESULT(vkCreateCommandPool(m_vulkanParams.Device, &cmdPoolInfo, nullptr, &m_sampleParams.GraphicsCommandPool));
    }

    // Create one command buffer for each swap chain image
    m_sampleParams.FrameRes.GraphicsCommandBuffers.resize(MAX_FRAME_LAG);

    VkCommandBufferAllocateInfo commandBufferAllocateInfo{};
    commandBufferAllocateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
    commandBufferAllocateInfo.commandPool = m_sampleParams.GraphicsCommandPool;
    commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
    commandBufferAllocateInfo.commandBufferCount = static_cast<uint32_t>(MAX_FRAME_LAG);

    VK_CHECK_RESULT(vkAllocateCommandBuffers(m_vulkanParams.Device, &commandBufferAllocateInfo, m_sampleParams.FrameRes.GraphicsCommandBuffers.data()));
}

void VKSample::CreateSynchronizationObjects()
{
    m_sampleParams.FrameRes.ImageAvailableSemaphores.resize(MAX_FRAME_LAG);
    m_sampleParams.FrameRes.RenderingFinishedSemaphores.resize(MAX_FRAME_LAG);
    m_sampleParams.FrameRes.Fences.resize(MAX_FRAME_LAG);

    // Create semaphores to synchronize acquiring presentable images before rendering and 
    // waiting for drawing to be complete before presenting
    VkSemaphoreCreateInfo semaphoreCreateInfo = {};
    semaphoreCreateInfo.sType = VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO;
    semaphoreCreateInfo.pNext = nullptr;

    // Create fences to synchronize CPU and GPU timelines.
    VkFenceCreateInfo fenceCreateInfo{};
    fenceCreateInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
    fenceCreateInfo.flags = VK_FENCE_CREATE_SIGNALED_BIT;

    for (size_t i = 0; i < MAX_FRAME_LAG; i++)
    {
        // Create an unsignaled semaphore
        VK_CHECK_RESULT(vkCreateSemaphore(m_vulkanParams.Device, &semaphoreCreateInfo, nullptr, &m_sampleParams.FrameRes.ImageAvailableSemaphores[i]));

        // Create an unsignaled semaphore
        VK_CHECK_RESULT(vkCreateSemaphore(m_vulkanParams.Device, &semaphoreCreateInfo, nullptr, &m_sampleParams.FrameRes.RenderingFinishedSemaphores[i]));

        // Create a signaled fence
        VK_CHECK_RESULT(vkCreateFence(m_vulkanParams.Device, &fenceCreateInfo, nullptr, &m_sampleParams.FrameRes.Fences[i]));
    }
}


In the VKFrameBuffering class, we create as many host-visible buffers as frames we can queue because buffer data needs to be updated on a per-frame basis. Therefore, we need to preserve this resource for pending frames.


void VKHelloFrameBuffering::CreateHostVisibleBuffers()
{
    //
    // Create buffers in host-visible device memory
    // since they need to be updated from the CPU on a per-frame basis.
    //
    
    // Used to request an allocation of a specific size from a certain memory type.
    VkMemoryAllocateInfo memAlloc = {};
    memAlloc.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
    VkMemoryRequirements memReqs;
    
    // Create the buffer object
    VkBufferCreateInfo bufferInfo = {};
    bufferInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
    bufferInfo.size = sizeof(uBufVS);
    bufferInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT;

    m_sampleParams.FrameRes.HostVisibleBuffers.resize(MAX_FRAME_LAG);
    for (size_t i = 0; i < MAX_FRAME_LAG; i++)
    {
        VK_CHECK_RESULT(vkCreateBuffer(m_vulkanParams.Device, &bufferInfo, nullptr, &m_sampleParams.FrameRes.HostVisibleBuffers[i].Handle));

        // Request a memory allocation from coherent, host-visible device memory that is large 
        // enough to hold the buffer.
        // VK_MEMORY_PROPERTY_HOST_COHERENT_BIT makes sure writes performed by the host (application)
        // will be directly visible to the device without requiring the explicit flushing of cached memory.
        vkGetBufferMemoryRequirements(m_vulkanParams.Device,m_sampleParams.FrameRes.HostVisibleBuffers[i].Handle, &memReqs);
        memAlloc.allocationSize = memReqs.size;
        memAlloc.memoryTypeIndex = GetMemoryTypeIndex(memReqs.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, m_deviceMemoryProperties);
        VK_CHECK_RESULT(vkAllocateMemory(m_vulkanParams.Device, &memAlloc, nullptr, &m_sampleParams.FrameRes.HostVisibleBuffers[i].Memory));

        // Map the host-visible device memory just allocated.
        // Leave it mapped so that we don't have to map and unmap it every time we want to update the buffer data.
        VK_CHECK_RESULT(vkMapMemory(m_vulkanParams.Device, 
                                    m_sampleParams.FrameRes.HostVisibleBuffers[i].Memory, 
                                    0, memAlloc.allocationSize, 
                                    0, &m_sampleParams.FrameRes.HostVisibleBuffers[i].MappedMemory));

        // Bind the buffer object to the backing host-visible device memory just allocated.
        VK_CHECK_RESULT(vkBindBufferMemory(m_vulkanParams.Device, 
                                           m_sampleParams.FrameRes.HostVisibleBuffers[i].Handle, 
                                           m_sampleParams.FrameRes.HostVisibleBuffers[i].Memory, 0));

        // Store information needed to write\update the corresponding descriptor (uniform buffer) in the descriptor set later.
        m_sampleParams.FrameRes.HostVisibleBuffers[i].Descriptor.buffer = m_sampleParams.FrameRes.HostVisibleBuffers[i].Handle;
        m_sampleParams.FrameRes.HostVisibleBuffers[i].Descriptor.offset = 0;
        m_sampleParams.FrameRes.HostVisibleBuffers[i].Descriptor.range = sizeof(uBufVS);
    }
}


This also applies to descriptor sets, which can be updated by writing descriptors during frame creation.


void VKHelloFrameBuffering::CreateDescriptorPool()
{
    //
    // To calculate the amount of memory required for a descriptor pool, the implementation needs to know
    // the max numbers of descriptor sets we will request from the pool, and the number of descriptors 
    // per type we will include in those descriptor sets.
    //

    // Describe the number of descriptors per type.
    // This sample uses one descriptor type (uniform buffer) and requests MAX_FRAME_LAG descriptors 
    // of this type (one for each of the MAX_FRAME_LAG descriptor sets we will use to preserve frame resources)
    VkDescriptorPoolSize typeCounts[1];
    typeCounts[0].type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
    typeCounts[0].descriptorCount = static_cast<uint32_t>(MAX_FRAME_LAG);
    // For additional types you need to add new entries in the type count list
    // E.g. for two combined image samplers:
    // typeCounts[1].type = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;
    // typeCounts[1].descriptorCount = 4;

    // Create a global descriptor pool
    // All descriptors set used in this sample will be allocated from this pool
    VkDescriptorPoolCreateInfo descriptorPoolInfo = {};
    descriptorPoolInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
    descriptorPoolInfo.pNext = nullptr;
    descriptorPoolInfo.poolSizeCount = 1;
    descriptorPoolInfo.pPoolSizes = typeCounts;
    // Set the max. number of descriptor sets that can be requested from this pool (requesting beyond this limit will result in an error)
    descriptorPoolInfo.maxSets = static_cast<uint32_t>(MAX_FRAME_LAG);

    VK_CHECK_RESULT(vkCreateDescriptorPool(m_vulkanParams.Device, &descriptorPoolInfo, nullptr, &m_sampleParams.DescriptorPool));
}


AllocateDescriptorSet also takes into account the need for multiple descriptor sets. Since all descriptor sets have a common descriptor set layout, we create an array of elements that contains the same descriptor set layout. In the for loop, we write a uniform buffer in each of the descriptor sets. Each of these uniform buffers will point to a different buffer in host-visible device memory, allowing each frame to use its own buffer.


void VKHelloFrameBuffering::AllocateDescriptorSets()
{
    // Allocate MAX_FRAME_LAG descriptor sets from the global descriptor pool.
    // Use the descriptor set layout to calculate the amount on memory required to store the descriptor sets.
    VkDescriptorSetAllocateInfo allocInfo = {};
    allocInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
    allocInfo.descriptorPool = m_sampleParams.DescriptorPool;
    allocInfo.descriptorSetCount = static_cast<uint32_t>(MAX_FRAME_LAG);
    std::vector<VkDescriptorSetLayout> DescriptorSetLayouts(MAX_FRAME_LAG, m_sampleParams.DescriptorSetLayout);
    allocInfo.pSetLayouts = DescriptorSetLayouts.data();

    m_sampleParams.FrameRes.DescriptorSets.resize(MAX_FRAME_LAG);
    VK_CHECK_RESULT(vkAllocateDescriptorSets(m_vulkanParams.Device, &allocInfo, m_sampleParams.FrameRes.DescriptorSets.data()));

    //
    // Write the descriptors updating the corresponding descriptor sets.
    // For every binding point used in a shader code there needs to be at least a descriptor 
    // in a descriptor set matching that binding point.
    //

    VkWriteDescriptorSet writeDescriptorSet = {};

    for (size_t i = 0; i < MAX_FRAME_LAG; i++)
    {
        // Write the descriptor of the uniform buffer.
        // We need to pass the descriptor set where it is store and 
        // the binding point associated with descriptor in the descriptor set.
        writeDescriptorSet.sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
        writeDescriptorSet.dstSet = m_sampleParams.FrameRes.DescriptorSets[i];
        writeDescriptorSet.descriptorCount = 1;
        writeDescriptorSet.descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
        writeDescriptorSet.pBufferInfo = &m_sampleParams.FrameRes.HostVisibleBuffers[i].Descriptor;
        writeDescriptorSet.dstBinding = 0;

        vkUpdateDescriptorSets(m_vulkanParams.Device, 1, &writeDescriptorSet, 0, nullptr);
    }
}


As we know, in the render loop we call both OnUpdate and OnRender. In this case, OnUpdate still calls UpdateHostVisibleBufferData as in the previous tutorial, but now it uses the m_frameIndex variable to select the host-visible buffer associated with the current frame.


void VKHelloFrameBuffering::UpdateHostVisibleBufferData()
{
    const float translationSpeed = 0.8f;    // speed
    const float offsetBounds = 0.75f;       // bound the displacement within the range [-1, +1]; see CreateVertexBuffer
    static float direction = 1.0f;          // direction
 
    // From speed = (displacement / time) you can derive that: displacement = (speed * time)
    uBufVS.displacement[0] += direction * translationSpeed * m_timer.GetElapsedSeconds();
    if (uBufVS.displacement[0] > offsetBounds)
    {
        uBufVS.displacement[0] = 0.75f;
        direction *= -1.0f;
    }
    else if (uBufVS.displacement[0] < -offsetBounds)
    {
        uBufVS.displacement[0] = -0.75f;
        direction *= -1.0f;
    }

    // Update uniform buffer data
    // Note: Since we requested a host coherent memory type for the uniform buffer, the write is instantly visible to the GPU
    memcpy(m_sampleParams.FrameRes.HostVisibleBuffers[m_frameIndex].MappedMemory, &uBufVS, sizeof(uBufVS));
}


This also applies to all member functions of VKHelloFrameBuffering that access frame resources. For more details, refer to the complete source code of the sample.


// Render the scene.
void VKHelloFrameBuffering::OnRender()
{
    // Ensure no more than MAX_FRAME_LAG frames are queued.
    VK_CHECK_RESULT(vkWaitForFences(m_vulkanParams.Device, 1, &m_sampleParams.FrameRes.Fences[m_frameIndex], VK_TRUE, UINT64_MAX));
    VK_CHECK_RESULT(vkResetFences(m_vulkanParams.Device, 1, &m_sampleParams.FrameRes.Fences[m_frameIndex]));

    // Get the index of the next available image in the swap chain
    uint32_t imageIndex;
    VkResult acquire = vkAcquireNextImageKHR(m_vulkanParams.Device, 
                                             m_vulkanParams.SwapChain.Handle, 
                                             UINT64_MAX, 
                                             m_sampleParams.FrameRes.ImageAvailableSemaphores[m_frameIndex], 
                                             nullptr, &imageIndex);
    if (!((acquire == VK_SUCCESS) || (acquire == VK_SUBOPTIMAL_KHR)))
    {
        if (acquire == VK_ERROR_OUT_OF_DATE_KHR)
            WindowResize(m_width, m_height);
        else
            VK_CHECK_RESULT(acquire);
    }

    PopulateCommandBuffer(imageIndex);

    SubmitCommandBuffer();

    PresentImage(imageIndex);

    // WAITING FOR THE GPU TO COMPLETE THE FRAME BEFORE CONTINUING IS NOT BEST PRACTICE.
    // vkQueueWaitIdle is used for simplicity.
    // (so that we can reuse the command buffer indexed with m_frameIndex)
    //VK_CHECK_RESULT(vkQueueWaitIdle(m_vulkanParams.GraphicsQueue.Handle));

    // Update command buffer index
    m_frameIndex = (m_frameIndex + 1) % MAX_FRAME_LAG;
}


At the beginning of OnRender, vkWaitForFences is called to prevent queuing more frames than expected. This is possible because after populating the command buffer for the current frame, we call SubmitCommandBuffer which in turn invokes vkQueueSubmit passing a fence object as its last parameter. This will append a fence operation at the end of the submission batches specified in vkQueueSubmit, so that the fence will be signaled once the execution of all these submission batches is complete.

vkWaitForFences is used to wait for one or more fences to be signaled before proceeding. The last two parameters specify whether to wait for all or any fences to be signaled and how long to wait before returning.

Observe that the first MAX_FRAME_LAG times OnRender is called, we don’t need to synchronize frame resource because we have separated resources for each of the MAX_FRAME_LAG frames we can queue. In other words, we can queue the first MAX_FRAME_LAG frames without waiting on the CPU timeline for vkWaitForFences to return after a fence is signaled as ther is no chance that frame resources can be overwritten. That’s why we create the fences in a signaled state. However, unlike semaphores that are automatically reset, we need to manually reset fences with vkResetFences.

At the end of OnRender, we don’t need to wait for the GPU queue to become idle by calling vkQueueWaitIdle anymore. Instead, we only need to increase the value of the m_frameIndex variable to select the resources associated with the next frame during the following iteration of the rendering loop.


void VKHelloFrameBuffering::SubmitCommandBuffer()
{
    // Pipeline stage at which the queue submission will wait (via pWaitSemaphores)
    VkPipelineStageFlags waitStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
    // The submit info structure specifies a command buffer queue submission batch
    VkSubmitInfo submitInfo = {};
    submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
    submitInfo.pWaitDstStageMask = &waitStageMask;                                                // Pointer to the list of pipeline stages that the semaphore waits will occur at
    submitInfo.waitSemaphoreCount = 1;                                                            // One wait semaphore
    submitInfo.signalSemaphoreCount = 1;                                                          // One signal semaphore
    submitInfo.pCommandBuffers = &m_sampleParams.FrameRes.GraphicsCommandBuffers[m_frameIndex];   // Command buffers(s) to execute in this batch (submission)
    submitInfo.commandBufferCount = 1;                                                            // One command buffer

    submitInfo.pWaitSemaphores = &m_sampleParams.FrameRes.ImageAvailableSemaphores[m_frameIndex];        // Semaphore(s) to wait upon before the submitted command buffers start executing
    submitInfo.pSignalSemaphores = &m_sampleParams.FrameRes.RenderingFinishedSemaphores[m_frameIndex];   // Semaphore(s) to be signaled when command buffers have completed

    VK_CHECK_RESULT(vkQueueSubmit(m_vulkanParams.GraphicsQueue.Handle, 1, &submitInfo, m_sampleParams.FrameRes.Fences[m_frameIndex]));
}


Please note that we could also have waited for RenderingFinishedSemaphores to be signaled at the beginning of OnRender. However, it is important to note the difference between the two approaches. The array of semaphores assigned to the pSignalSemaphores member of VkSubmitInfo is signaled when the execution of the corresponding submission batch is complete. On the other hand, using vkWaitForFences to wait for the fence passed to vkQueueSubmit to be signaled ensures that all submission batches have been completed before continuing with creating frames on the CPU timeline.



Source code: LearnVulkan


References

[1] Vulkan API Specifications



If you found the content of this tutorial somewhat useful or interesting, please consider supporting this project by clicking on the Sponsor button. Whether a small tip, a one time donation, or a recurring payment, it’s all welcome! Thank you!

Sponsor