Post

Replies

Boosts

Views

Activity

DocC Fails to Locate <string>
Xcode 14's release notes say that DocC now supports Objective-C and C APIs. To clarify, does this mean that C++ and Objective-C++ APIs are not yet supported? I was attempting to convert from Doxygen, but the document build fails with the error Command ExtractAPI failed with a nonzero exit code on the first line of the header citing the first line: #include <string>.
4
2
2.1k
Feb ’23
ICB Support for Object/Mesh Shaders
Hello! I am starting to dig into the docs on object and mesh shaders. I see that the Metal API on the CPU side has new functions for setting object and mesh buffers in the new programable stage. But I don't see corresponding changes to the API for MTLIndirectCommandBuffer. Will we be able to use the GPU to encode draw commands using a pipeline that leverages the new shader types? Thanks,
0
0
1.2k
Jun ’22
TBDR Persistent Threadgroup Memory/Mid-Render Kernel
Hello - I am in the early phase of developing an algorithm and was hopeful someone could help me understand how threadgroup memory persists before I go too far down the wrong path. For simplicity, let's say I am working with 32 KB of threadgroup memory, and I have two kernels K1 and K2. In the first pass, each threadgroup in K1 loads 8129 32-bit values into threadgroup memory (using all 32 KB). In the next pass, K2 access the threadgroup memory from K1 and performs some operation on the data. Since threadgroup memory usually persists only during the lifetime of the threadgroup, in this mid-render kernel example, what can K2 access from the threadgroup memory in K1? For example, say we have: kernel void K1(threadgroup uint * mem_k1 [[ threadgroup(0) ]] ); kernel void K2(threadgroup uint * mem_k2 [[ threadgroup(0) ]] ); Say we launch both kernels with 10 threadgroups. Can K2 access every block of threadgroup memory initialized in K1? Or does [[ threadgroup(0) ]] refer only to 1 block of 32KB memory? If we launch K1 and K2 with a different number of threadgroups per grid, does that change anything? Or is [[threadgroup(0)]] completely dependent on what the host code allocates via the Metal API? Thank you in advance.
1
0
1k
Dec ’21
Using Function Address Space Qualifiers
Hello - I was hopeful someone could help me understand the syntax of address space qualifiers that I have seen in the metal standard library. I have seen the qualifiers used on constructors and on functions such as the following: struct A { A() thread : { } A() device : { } void do_it() thread { device int& param } // 1 void do_it() thread { thread int& param } // 2 void do_it() device { device int& param } // 3 void do_it() device { thread int& param } // 4 } What do the address qualifiers mean in this context exactly? Are these overloads depending on where A resides in memory? For example: kernel void process_data(device A& d_var, device uint& d_out ) { thread A p_var { }; // OK threadgroup A t_var { }; // ERROR? thread int p_out = 0; threadgroup int t_out = 0; p_var.do_it(d_out); // calls 1? p_var.do_it(p_out); // calls 2? p_var.do_it(t_out); // Error? d_var.do_it(d_out); // calls 3? d_var.do_it(p_out); // calls 4? d_var.do_it(t_out); // Error? } Or does this mean something else entirely? Would it be valid to add an address space qualifier to a global function? For example: void do_it() device; // declared in program scope If so, what does it mean for a function to be in (e.g.) device memory? Thank you for helping me understanding this better.
1
0
964
Nov ’21
Metal SIMD Group Functions
Is there any documentation aside from the Metal Shading Language Specification that explains the SIMD group functions such as simd_or simd_and simd_xor etc. CUDA has functions comparable to the vote/ballot simd functions, but I am not aware of anything similar to the above. The MSL specification is vague in stating - for example - simd_or returns the bitwise OR across all active threads. Does that mean T data is applied bitwise to the value in each SIMD lane? What does that return exactly? Do you get only the highest or lowest or a sum of the bitwise operation?
0
0
1.2k
Aug ’21
DocC Fails to Locate <string>
Xcode 14's release notes say that DocC now supports Objective-C and C APIs. To clarify, does this mean that C++ and Objective-C++ APIs are not yet supported? I was attempting to convert from Doxygen, but the document build fails with the error Command ExtractAPI failed with a nonzero exit code on the first line of the header citing the first line: #include <string>.
Replies
4
Boosts
2
Views
2.1k
Activity
Feb ’23
ICB Support for Object/Mesh Shaders
Hello! I am starting to dig into the docs on object and mesh shaders. I see that the Metal API on the CPU side has new functions for setting object and mesh buffers in the new programable stage. But I don't see corresponding changes to the API for MTLIndirectCommandBuffer. Will we be able to use the GPU to encode draw commands using a pipeline that leverages the new shader types? Thanks,
Replies
0
Boosts
0
Views
1.2k
Activity
Jun ’22
TBDR Persistent Threadgroup Memory/Mid-Render Kernel
Hello - I am in the early phase of developing an algorithm and was hopeful someone could help me understand how threadgroup memory persists before I go too far down the wrong path. For simplicity, let's say I am working with 32 KB of threadgroup memory, and I have two kernels K1 and K2. In the first pass, each threadgroup in K1 loads 8129 32-bit values into threadgroup memory (using all 32 KB). In the next pass, K2 access the threadgroup memory from K1 and performs some operation on the data. Since threadgroup memory usually persists only during the lifetime of the threadgroup, in this mid-render kernel example, what can K2 access from the threadgroup memory in K1? For example, say we have: kernel void K1(threadgroup uint * mem_k1 [[ threadgroup(0) ]] ); kernel void K2(threadgroup uint * mem_k2 [[ threadgroup(0) ]] ); Say we launch both kernels with 10 threadgroups. Can K2 access every block of threadgroup memory initialized in K1? Or does [[ threadgroup(0) ]] refer only to 1 block of 32KB memory? If we launch K1 and K2 with a different number of threadgroups per grid, does that change anything? Or is [[threadgroup(0)]] completely dependent on what the host code allocates via the Metal API? Thank you in advance.
Replies
1
Boosts
0
Views
1k
Activity
Dec ’21
Using Function Address Space Qualifiers
Hello - I was hopeful someone could help me understand the syntax of address space qualifiers that I have seen in the metal standard library. I have seen the qualifiers used on constructors and on functions such as the following: struct A { A() thread : { } A() device : { } void do_it() thread { device int& param } // 1 void do_it() thread { thread int& param } // 2 void do_it() device { device int& param } // 3 void do_it() device { thread int& param } // 4 } What do the address qualifiers mean in this context exactly? Are these overloads depending on where A resides in memory? For example: kernel void process_data(device A& d_var, device uint& d_out ) { thread A p_var { }; // OK threadgroup A t_var { }; // ERROR? thread int p_out = 0; threadgroup int t_out = 0; p_var.do_it(d_out); // calls 1? p_var.do_it(p_out); // calls 2? p_var.do_it(t_out); // Error? d_var.do_it(d_out); // calls 3? d_var.do_it(p_out); // calls 4? d_var.do_it(t_out); // Error? } Or does this mean something else entirely? Would it be valid to add an address space qualifier to a global function? For example: void do_it() device; // declared in program scope If so, what does it mean for a function to be in (e.g.) device memory? Thank you for helping me understanding this better.
Replies
1
Boosts
0
Views
964
Activity
Nov ’21
Metal SIMD Group Functions
Is there any documentation aside from the Metal Shading Language Specification that explains the SIMD group functions such as simd_or simd_and simd_xor etc. CUDA has functions comparable to the vote/ballot simd functions, but I am not aware of anything similar to the above. The MSL specification is vague in stating - for example - simd_or returns the bitwise OR across all active threads. Does that mean T data is applied bitwise to the value in each SIMD lane? What does that return exactly? Do you get only the highest or lowest or a sum of the bitwise operation?
Replies
0
Boosts
0
Views
1.2k
Activity
Aug ’21