15 November 2023

vk02.G - Compute Particles

by P. A. Minerva

1 - Introduction

In a previous tutorial (02.D - Transform Feedback), we explored the transform feedback as a stage of the graphics pipeline. Despite its utility, we highlighted its discouraged use due to legacy nature and potential performance issues. This bonus tutorial aims to demonstrate a more straightforward implementation of the same sample from that tutorial by leveraging the compute shader of the compute pipeline instead of the transform feedback.

[!IMPORTANT]
Switching between graphics and compute pipelines can result in a performance cost. This implies that, especially in scenarios with minimal compute work, opting for a compute shader over enabling transform feedback (which is part of the graphics pipeline) may not consistently lead to improved performance.
However, performance enhancement can often be facilited through async compute. This involves recording compute and graphics work asynchronously and queuing the corresponding command buffers in different GPU queues, if available. Indeed, this approach relies on the availability of multiple queues on the device capable of handling various command types (graphics, compute, etc.), a feature often lacking, especially on integrated and mobile GPUs.

To implement the simple particle effect shown in tutorial 02.D using transform feedback, the following steps were performed:

Check Device Support: Verify if the device supports transform feedback.
Enable Transform Feedback: Inform the device driver of our intention to use this optional stage of the graphics pipeline.
Retrieve Extension Functions: Obtain addresses of extension functions supporting transform feedback (e.g., vkCmdBeginTransformFeedbackEXT).
Create Buffers: Generate transform feedback and counter buffers.
Vertex Buffer Creation: Generate a separate vertex buffer to store particle attributes (position, size, and speed) for the first frame.
Pipeline Setup: Create two graphics pipeline objects — one for capturing vertex attributes and the other for rendering.
Shader Configuration: Set the vertex shader for the capturing step and the programmable stages for the rendering step.
Activate Transform Feedback: Activate the capturing mode for the transform feedback buffer bound to the command buffer.
Update and Capture Attributes: update particle positions in the vertex shader of the capturing step, and capture vertex attributes by putting them into the transform feedback buffer.
Deactivate Transform Feedback: Deactivate the capturing mode for the transform feedback buffer bound to the command buffer, and store the current byte position in the counter buffer.
Use the Transform Feedback Buffer: Use the transform feedback buffer as the vertex buffer for the rendering step to create a new frame.
Draw Indirectly: Call vkCmdDrawIndirectByteCountEXT, which allows to draw by determining the vertex count from the byte count stored in the counter buffer.
Synchronize Buffers: Synchronize both the transform feedback and counter buffers.

In contrast, using a compute shader, we can achieve the same particle effect with a more straightforward approach:

Storage Buffer Creation: Create a storage buffer for each frame in flight to store particle attributes.
Pipeline Setup: Create both compute and graphics pipeline objects.
Compute Shader Logic: Set the compute shader to update particle positions using old positions stored in the storage buffer from the previous frame.
Graphics Pipeline Stages Setup: Set the programmable stages of the graphics pipeline for drawing quads from particles (using a geometry shader).
Buffer Synchronization: Synchronize graphics and compute works for the current storage buffer only.

Building on knowledge acquired from the previous tutorial (02.F - Compute Shader), let’s see how we can re-implement the sample presented in 02.D - Transform Feedback by using a compute shader.

2 - VKComputeParticles: code review

2.1 - C++ code

Let’s start by examining the definition of the VKComputeParticles class.

class VKComputeParticles : public VKSample
{
public:


    // ...


private:
    

    // ...


    // Buffer creation
    void CreateStagingBuffer();                  // Create a staging buffer
    void CreateStorageBuffers();                 // Create storage buffers

    // Compute setup and operations
    void PrepareCompute();
    void PopulateComputeCommandBuffer();
    void SubmitComputeCommandBuffer();


    // For simplicity we use the same uniform block layout used in shader code:
    //
    // layout(std140, set = 0, binding = 0) uniform buf {
    //     mat4 viewMatrix;
    //     mat4 projMatrix;
    //     vec3 cameraPos;
    //     float deltaTime;
    // } uBuf;
    //
    // This way we can just memcopy the uBufVS data to match the uBuf memory layout.
    // Note: You should use data types that align with the GPU in order to avoid manual padding (vec4, mat4)
    struct {
        glm::mat4 viewMatrix;         // 64 bytes
        glm::mat4 projectionMatrix;   // 64 bytes
        glm::vec3 cameraPos;          // 12 bytes
        float     deltaTime;          // 4 bytes
    } uBufVS;

    // Uniform block defined in the shader code to be used as a dynamic uniform buffer:
    //
    //layout(std140, set = 0, binding = 1) uniform dynbuf {
    //     mat4 worlddMatrix;
    //     vec4 solidColor;
    // } dynBuf;
    //
    // Allow the specification of different world matrices for different objects by offsetting
    // into the same buffer.
    struct MeshInfo{
        glm::mat4 worldMatrix;
        glm::vec4 solidColor;
    };

    struct {
        MeshInfo *meshInfo;  // pointer to an array of mesh info
    } dynUBufVS;
    
    // Vertex layout used in this sample (stride: 32 bytes)
    // We will store multiple vertices countiguously in storage buffers used both as 
    // vertex buffer and uniform buffer, so we need to pad.
    struct Vertex {
        glm::vec3 position;
        float pad_1;
        glm::vec2 size;
        float speed;
        float pad_2;
    };

    // Mesh object info
    struct MeshObject
    {
        uint32_t dynIndex;
        uint32_t indexCount;
        uint32_t firstIndex;
        uint32_t vertexOffset;
        uint32_t vertexCount;
        MeshInfo *meshInfo;
    };

    // Storage buffer info and data
    struct StorageBuf {
        BufferParameters StorageBuffer;

        // Buffer lenght and element size
        static const uint32_t BufferLenght = 81; // num. of particles
        static const uint32_t BufferElementSize = sizeof(Vertex);  // byte size of each particle
    };

    // In this sample we have a single draw call.
    const unsigned int m_numDrawCalls = 1;

    // Mesh objects to draw
    std::map<std::string, MeshObject> m_meshObjects;

    // Storage buffers (one for each frame in flight).
    std::vector<StorageBuf> m_storageBuffers;
    BufferParameters m_stagingBuffer;    // Staging buffer

    // Compute resources and variables
    SampleParameters m_sampleComputeParams;

    // Sample members
    size_t m_dynamicUBOAlignment;
    std::vector<Vertex> m_particles;
}

We will create as many stora buffers as frams in flight, so that we can update a specific storage buffer and subsequently use it as a vertex buffer bound to the command buffer for rendering the corresponding frame.

Given that these storage buffers will contain an array of vertices\particles and will be used both as shader resources (input/output buffers) and vertex buffers, manual padding is somewhat necessary to adhere to the std140 layout rules.

The staging buffer will be used to store some particles at different positions in a plane, which will move at different speeds. Each of the storage buffer will be used as a destination for a copy operation to transfert data from the staging buffer to device local memory. Refer to the CreateStagingBuffer and CreateStorageBuffers functions in the tutorial’s repository for more details.

The following listing illustrates the descriptor set layout that will be used by both the compute and graphics pipeline objects.

void VKComputeParticles::CreateDescriptorSetLayout()
{
    //
    // Create a Descriptor Set Layout to connect binding points (resource declarations)
    // in the shader code to descriptors within descriptor sets.
    //

    VkDescriptorSetLayoutBinding layoutBinding[4] = {};

    // Binding 0: Uniform buffer (accessed by CS and GS)
    layoutBinding[0].binding = 0;
    layoutBinding[0].descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
    layoutBinding[0].descriptorCount = 1;
    layoutBinding[0].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT | VK_SHADER_STAGE_GEOMETRY_BIT;
    layoutBinding[0].pImmutableSamplers = nullptr;

    // Binding 1: Dynamic uniform buffer (accessed by GS and FS)
    layoutBinding[1].binding = 1;
    layoutBinding[1].descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC;
    layoutBinding[1].descriptorCount = 1;
    layoutBinding[1].stageFlags = VK_SHADER_STAGE_GEOMETRY_BIT | VK_SHADER_STAGE_FRAGMENT_BIT;
    layoutBinding[1].pImmutableSamplers = nullptr;

    // Binding 2: Previous Storage Buffer (accessed by CS)
    layoutBinding[2].binding = 2;
    layoutBinding[2].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
    layoutBinding[2].descriptorCount = 1;
    layoutBinding[2].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
    layoutBinding[2].pImmutableSamplers = nullptr;

    // Binding 3: Current Storage Buffer (accessed by CS)
    layoutBinding[3].binding = 3;
    layoutBinding[3].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
    layoutBinding[3].descriptorCount = 1;
    layoutBinding[3].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
    layoutBinding[3].pImmutableSamplers = nullptr;

    VkDescriptorSetLayoutCreateInfo descriptorLayout = {};
    descriptorLayout.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
    descriptorLayout.pNext = nullptr;
    descriptorLayout.bindingCount = 4;
    descriptorLayout.pBindings = layoutBinding;

    VK_CHECK_RESULT(vkCreateDescriptorSetLayout(m_vulkanParams.Device, &descriptorLayout, nullptr, &m_sampleParams.DescriptorSetLayout));
}

As you can see in the code of the AllocateDescriptorSets function below, the resource associated with the binding point 2 is the storage buffer used as the current one by the previous frame. This will allow the compute shader to update the previous particle positions and store them in the current storage buffer, which is associated with the binding point 3.

void VKComputeParticles::AllocateDescriptorSets()
{
    // Allocate MAX_FRAME_LAG descriptor sets from the global descriptor pool.
    // Use the descriptor set layout to calculate the amount on memory required to store the descriptor sets.
    VkDescriptorSetAllocateInfo allocInfo = {};
    allocInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
    allocInfo.descriptorPool = m_sampleParams.DescriptorPool;
    allocInfo.descriptorSetCount = static_cast<uint32_t>(MAX_FRAME_LAG);
    std::vector<VkDescriptorSetLayout> DescriptorSetLayouts(MAX_FRAME_LAG, m_sampleParams.DescriptorSetLayout);
    allocInfo.pSetLayouts = DescriptorSetLayouts.data();

    m_sampleParams.FrameRes.DescriptorSets[DESC_SET_GRAPH_COMP].resize(MAX_FRAME_LAG);

    VK_CHECK_RESULT(vkAllocateDescriptorSets(m_vulkanParams.Device, &allocInfo, m_sampleParams.FrameRes.DescriptorSets[DESC_SET_GRAPH_COMP].data()));

    //
    // Write the descriptors updating the corresponding descriptor sets.
    // For every binding point used in a shader code there needs to be at least a descriptor 
    // in a descriptor set matching that binding point.
    //
    VkWriteDescriptorSet writeDescriptorSet[4] = {};

    for (size_t i = 0; i < MAX_FRAME_LAG; i++)
    {
        // Write the descriptor of the uniform buffer.
        // We need to pass the descriptor set where it is store and 
        // the binding point associated with the descriptor in the descriptor set.
        writeDescriptorSet[0].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
        writeDescriptorSet[0].dstSet = m_sampleParams.FrameRes.DescriptorSets[DESC_SET_GRAPH_COMP][i];
        writeDescriptorSet[0].descriptorCount = 1;
        writeDescriptorSet[0].descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
        writeDescriptorSet[0].pBufferInfo = &m_sampleParams.FrameRes.HostVisibleBuffers[i].Descriptor;
        writeDescriptorSet[0].dstBinding = 0;

        // Write the descriptor of the dynamic uniform buffer.
        writeDescriptorSet[1].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
        writeDescriptorSet[1].dstSet = m_sampleParams.FrameRes.DescriptorSets[DESC_SET_GRAPH_COMP][i];
        writeDescriptorSet[1].descriptorCount = 1;
        writeDescriptorSet[1].descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC;
        writeDescriptorSet[1].pBufferInfo = &m_sampleParams.FrameRes.HostVisibleDynamicBuffers[i].Descriptor;
        writeDescriptorSet[1].dstBinding = 1;

        // Write the descriptor of the previous storage buffer.
        writeDescriptorSet[2].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
        writeDescriptorSet[2].dstSet = m_sampleParams.FrameRes.DescriptorSets[DESC_SET_GRAPH_COMP][i];
        writeDescriptorSet[2].descriptorCount = 1;
        writeDescriptorSet[2].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
        writeDescriptorSet[2].pBufferInfo = &m_storageBuffers[(i - 1) % MAX_FRAME_LAG].StorageBuffer.Descriptor; // if (i == 0) then -1 % 2 = 1, since -1 = 2*(-1) + 1 <-- remainder
        writeDescriptorSet[2].dstBinding = 2;

        // Write the descriptor of the current storage buffer.
        writeDescriptorSet[3].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
        writeDescriptorSet[3].dstSet = m_sampleParams.FrameRes.DescriptorSets[DESC_SET_GRAPH_COMP][i];
        writeDescriptorSet[3].descriptorCount = 1;
        writeDescriptorSet[3].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
        writeDescriptorSet[3].pBufferInfo = &m_storageBuffers[i].StorageBuffer.Descriptor;
        writeDescriptorSet[3].dstBinding = 3;

        vkUpdateDescriptorSets(m_vulkanParams.Device, 4, writeDescriptorSet, 0, nullptr);
    }
}

The graphics pipeline will start with points in the vertex shader and subsequently transforms their geometry into quads in the geometry shader. These quads are then rendered with a white transparent color in the fragment shader. Please note that we won’t be delving into the shader code of the graphics pipeline here, as it has already been reviewed in 02.D - Transform Feedback.

void VKComputeParticles::CreatePipelineObjects()
{
    //
    //  Set the various states for the graphics pipeline used by this sample
    //


    // ...

    
    // Input assembly state describes how primitives are assembled by the input assembler.
    // This pipeline will assemble vertex data as a triangle lists.
    VkPipelineInputAssemblyStateCreateInfo inputAssemblyState = {};
    inputAssemblyState.sType = VK_STRUCTURE_TYPE_PIPELINE_INPUT_ASSEMBLY_STATE_CREATE_INFO;
    inputAssemblyState.topology = VK_PRIMITIVE_TOPOLOGY_POINT_LIST;
    
    //
    // Rasterization state
    //
    VkPipelineRasterizationStateCreateInfo rasterizationState = {};
    rasterizationState.sType = VK_STRUCTURE_TYPE_PIPELINE_RASTERIZATION_STATE_CREATE_INFO;
    rasterizationState.polygonMode = VK_POLYGON_MODE_FILL;
    rasterizationState.cullMode = VK_CULL_MODE_BACK_BIT;            // Cull back faces
    rasterizationState.frontFace = VK_FRONT_FACE_COUNTER_CLOCKWISE;
    rasterizationState.lineWidth = 1.0f;
    
    //
    // Shaders
    //
    VkShaderModule renderVS = LoadSPIRVShaderModule(m_vulkanParams.Device, GetAssetsPath() + "/data/shaders/render.vert.spv");
    VkShaderModule renderGS = LoadSPIRVShaderModule(m_vulkanParams.Device, GetAssetsPath() + "/data/shaders/render.geom.spv");
    VkShaderModule renderFS = LoadSPIRVShaderModule(m_vulkanParams.Device, GetAssetsPath() + "/data/shaders/render.frag.spv");


    // This sample will use three programmable stage: Vertex, Geometry and Fragment shaders
    std::array<VkPipelineShaderStageCreateInfo, 3> shaderStages{};
    
    // Vertex shader
    shaderStages[0].sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
    // Set pipeline stage for this shader
    shaderStages[0].stage = VK_SHADER_STAGE_VERTEX_BIT;
    // Load binary SPIR-V shader module
    shaderStages[0].module = renderVS;
    // Main entry point for the shader
    shaderStages[0].pName = "main";
    assert(shaderStages[0].module != VK_NULL_HANDLE);
    
    // Fragment shader
    shaderStages[1].sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
    // Set pipeline stage for this shader
    shaderStages[1].stage = VK_SHADER_STAGE_FRAGMENT_BIT;
    // Load binary SPIR-V shader module
    shaderStages[1].module = renderFS;
    // Main entry point for the shader
    shaderStages[1].pName = "main";
    assert(shaderStages[1].module != VK_NULL_HANDLE);

    // Geometry shader
    shaderStages[2].sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
    // Set pipeline stage for this shader
    shaderStages[2].stage = VK_SHADER_STAGE_GEOMETRY_BIT;
    // Load binary SPIR-V shader module
    shaderStages[2].module = renderGS;
    // Main entry point for the shader
    shaderStages[2].pName = "main";
    assert(shaderStages[2].module != VK_NULL_HANDLE);

    //
    // Enable alpha blending
    //
    blendAttachmentState[0].blendEnable = VK_TRUE;
    blendAttachmentState[0].srcColorBlendFactor = VK_BLEND_FACTOR_SRC_ALPHA;
    blendAttachmentState[0].dstColorBlendFactor = VK_BLEND_FACTOR_ONE_MINUS_SRC_ALPHA;
    blendAttachmentState[0].colorBlendOp = VK_BLEND_OP_ADD;

    
    // ...

    
    // Create a graphics pipeline for drawing using a solid color
    VK_CHECK_RESULT(vkCreateGraphicsPipelines(m_vulkanParams.Device, 
                                              VK_NULL_HANDLE, 1, 
                                              &pipelineCreateInfo, nullptr, 
                                              &m_sampleParams.Pipelines[PIPELINE_RENDER]));

    
    // ...

}

The code of the PrepareCompute function is similar to what we examined in the previous tutorial. It involves the creation of several pipeline resources, including a compute pipeline to execute a compute shader. This shader takes on the role of the vertex shader in the capturing step in the transform feedback tutorial, updating particle positions based on their speed. Please refer to the complete source code available in the tutorial’s repository for the implementation details of the PrepareCompute function.

During the execution of the compute work, we set a memory barrier for the current storage buffer, bind the compute pipeline and the descriptor set, and dispatch the compute work, which consists of a single work group.

void VKComputeParticles::PopulateComputeCommandBuffer()
{
    VkCommandBufferBeginInfo cmdBufInfo = {};
    cmdBufInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
    cmdBufInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;

    VK_CHECK_RESULT(vkBeginCommandBuffer(m_sampleComputeParams.FrameRes.CommandBuffers[m_frameIndex], &cmdBufInfo));

    // Set a memory barrier for the current storage buffer between VS and CS
    SetBufferMemoryBarrier(m_sampleComputeParams.FrameRes.CommandBuffers[m_frameIndex],
                           m_storageBuffers[m_frameIndex].StorageBuffer.Handle, VK_WHOLE_SIZE, 0,
                           VK_ACCESS_SHADER_READ_BIT, VK_PIPELINE_STAGE_VERTEX_SHADER_BIT,
                           VK_ACCESS_SHADER_WRITE_BIT, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);

    // Bind the compute pipeline to a compute bind point of the command buffer
    vkCmdBindPipeline(m_sampleComputeParams.FrameRes.CommandBuffers[m_frameIndex], 
                        VK_PIPELINE_BIND_POINT_COMPUTE, 
                        m_sampleComputeParams.Pipelines[PIPELINE_COMPUTE]);
    
    // Dynamic offset used to offset into the uniform buffer described by the dynamic uniform buffer and containing mesh information
    uint32_t dynamicOffset = m_meshObjects[MESH_PARTICLES].dynIndex * static_cast<uint32_t>(m_dynamicUBOAlignment);

    // Bind descriptor set
    vkCmdBindDescriptorSets(m_sampleComputeParams.FrameRes.CommandBuffers[m_frameIndex], 
                            VK_PIPELINE_BIND_POINT_COMPUTE, 
                            m_sampleParams.PipelineLayout, 
                            0, 1, 
                            &m_sampleParams.FrameRes.DescriptorSets[DESC_SET_GRAPH_COMP][m_frameIndex], 
                            1, &dynamicOffset);

    // Dispatch compute work
    vkCmdDispatch(m_sampleComputeParams.FrameRes.CommandBuffers[m_frameIndex], 1, 1, 1);

    VK_CHECK_RESULT(vkEndCommandBuffer(m_sampleComputeParams.FrameRes.CommandBuffers[m_frameIndex]));
}

2.2 - GLSL code

The compute shader defines a one-dimensional work group with 81 invocations, one for each particle.

#version 450

struct Particle {
    vec3 position;
    // pad_1
    vec2 size;
    float speed;
    // pad_2
};

layout(std140, set = 0, binding = 0) uniform bufUniform {
    mat4 viewMatrix;
    mat4 projMatrix;
    vec3 cameraPos;
    float deltaTime;
} uBuf;

layout(std140, set = 0, binding = 2) readonly buffer bufStorageIn {
   Particle particlesIn[ ];
};

layout(std140, set = 0, binding = 3) buffer bufStorageOut {
   Particle particlesOut[ ];
};

layout (local_size_x = 81, local_size_y = 1, local_size_z = 1) in;

void main() 
{
    uint index = gl_GlobalInvocationID.x;  

    particlesOut[index] = particlesIn[index]; // output particle is the same as the input particle but ...

    // ... decrease its height over time based on its speed and previous location
    particlesOut[index].position.z -= (particlesOut[index].speed * uBuf.deltaTime);
    
    // Reset the height of the particle at some point
    if (particlesOut[index].position.z < -50.0f)
    {
        particlesOut[index].position.z = 50.0f;
    }
}

Observe that unsized arrays can be declared in storage buffers. This is possible because storage buffers are generally larger than uniform buffers, allowing the declaration of unsized arrays. The actual size of these arrays is determined at runtime when the shader code is executed. This determination is based on the remaining space within the range of memory specified in the descriptor of the storage buffer, once the descriptor set containing it is bound to the command buffer. Consequently, only an unbounded array is permitted in a storage buffer, and it must be the last member in the corresponding block.
On the other hand, uniform buffers are intended to be read-only and have a much smaller size limit compared to storage buffers. For this reason, arrays declared in uniform buffers must be sized at compile-time.

The code for the shader programs in the graphics pipeline is similar to the one in the rendering step we examined in 02.D - Transform Feedback.

Source code: LearnVulkan

References

[1] Vulkan API Specifications

If you found the content of this tutorial somewhat useful or interesting, please consider supporting this project by clicking on the Sponsor button. Whether a small tip, a one time donation, or a recurring payment, it’s all welcome! Thank you!