Optimizing HZB Mip-Chain Generation and Bindless Argument Tables in a Custom Metal Engine

Hi everyone,

I’ve been developing a custom, end-to-end 3D rendering engine called Crescent from scratch using C++20 and Metal-cpp (targeting macOS and visionOS). My primary goal is to build a zero-bottleneck, GPU-driven pipeline that maximizes the potential of Apple Silicon’s Unified Memory and TBDR architecture.

While the fundamental systems are stable, I am looking for architectural feedback from Metal framework engineers regarding specific synchronization and latency challenges.

Current Core Implementations:

GPU-Driven Instance Culling: High-performance occlusion culling using a Hierarchical Z-Buffer (HZB) approach via Compute Shaders.

Clustered Forward Shading: Support for high-count dynamic lights through view-space clustering.

Temporal Stability: Custom TAA with history rejection and Motion Blur resolve.

Asset Infrastructure: Robust GUID-based scene serialization and a JSON-driven ECS hierarchy.

The Architectural Challenge: I am currently seeing slight synchronization overhead when generating the HZB mip-chain. On Apple Silicon, I am evaluating the cost of encoder transitions versus cache-friendly barriers.

        && m_hzbInitPipeline && m_hzbDownsamplePipeline && !m_hzbMipViews.empty();
    if (canBuildHzb) {
        MTL::ComputeCommandEncoder* hzbInit = commandBuffer->computeCommandEncoder();
        hzbInit->setComputePipelineState(m_hzbInitPipeline);
        hzbInit->setTexture(m_depthTexture, 0);
        hzbInit->setTexture(m_hzbMipViews[0], 1);
        if (m_pointClampSampler) {
            hzbInit->setSamplerState(m_pointClampSampler, 0);
        } else if (m_linearClampSampler) {
            hzbInit->setSamplerState(m_linearClampSampler, 0);
        }
        const uint32_t hzbWidth = m_hzbMipViews[0]->width();
        const uint32_t hzbHeight = m_hzbMipViews[0]->height();
        const uint32_t threads = 8;
        MTL::Size tgSize = MTL::Size(threads, threads, 1);
        MTL::Size gridSize = MTL::Size((hzbWidth + threads - 1) / threads * threads,
                                       (hzbHeight + threads - 1) / threads * threads,
                                       1);
        hzbInit->dispatchThreads(gridSize, tgSize);
        hzbInit->endEncoding();

        for (size_t mip = 1; mip < m_hzbMipViews.size(); ++mip) {
            MTL::Texture* src = m_hzbMipViews[mip - 1];
            MTL::Texture* dst = m_hzbMipViews[mip];
            if (!src || !dst) {
                continue;
            }
            MTL::ComputeCommandEncoder* downEncoder = commandBuffer->computeCommandEncoder();
            downEncoder->setComputePipelineState(m_hzbDownsamplePipeline);
            downEncoder->setTexture(src, 0);
            downEncoder->setTexture(dst, 1);
            const uint32_t mipWidth = dst->width();
            const uint32_t mipHeight = dst->height();
            MTL::Size downGrid = MTL::Size((mipWidth + threads - 1) / threads * threads,
                                           (mipHeight + threads - 1) / threads * threads,
                                           1);
            downEncoder->dispatchThreads(downGrid, tgSize);
            downEncoder->endEncoding();
        }

        if (m_instanceCullHzbPipeline) {
            dispatchInstanceCulling(m_instanceCullHzbPipeline, true);
        }
    }

My Questions:

Encoder Synchronization: Would you recommend moving this loop into a single ComputeCommandEncoder using MTLBarrier between dispatches to maintain L2 cache residency, or is the overhead of separate encoders negligible for depth-downsampling on TBDR?

visionOS Bindless Latency: For stereo rendering on visionOS, what are the best practices for managing MTL4ArgumentTable updates at 90Hz+? I want to ensure that updating bindless resources for each eye doesn't introduce unnecessary CPU-to-GPU latency.

Memory Management: Are there specific hints for Memoryless textures that could be applied to intermediate HZB levels to save bandwidth during this process?

I’ve attached a screenshot of a scene rendered with the engine (PBR, SSR, and IBL).

Optimizing HZB Mip-Chain Generation and Bindless Argument Tables in a Custom Metal Engine
 
 
Q