We have been having a mysterious crash in our media server app that I've never seen before. After fixing a number of other rare thread safety crashes relating to Metal buffers, this rare crash happens inside a Metal com.Metal.CompletionQueueDispatch?
I have no clue what is happening here. It looks to me like Metal is specifically calling abort() for some reason.
All of the other threads in the crash log appear to be in a normal state.
Thread 70 Crashed:: updateAllMedia Dispatch queue: com.Metal.CompletionQueueDispatch
0 libsystem_kernel.dylib 0x1af572d38 __pthread_kill + 8
1 libsystem_pthread.dylib 0x1af5a7ee0 pthread_kill + 288
2 libsystem_c.dylib 0x1af4e2330 abort + 168
3 libc++abi.dylib 0x1af562b18 abort_message + 132
4 libc++abi.dylib 0x1af552a3c demangling_terminate_handler() + 312
5 libobjc.A.dylib 0x1af4481c8 _objc_terminate() + 160
6 libc++abi.dylib 0x1af561eb4 std::__terminate(void (*)()) + 20
7 libc++abi.dylib 0x1af561e50 std::terminate() + 64
8 libdispatch.dylib 0x1af3e4288 _dispatch_client_callout4 + 40
9 libdispatch.dylib 0x1af40053c _dispatch_mach_msg_invoke + 464
10 libdispatch.dylib 0x1af3eb784 _dispatch_lane_serial_drain + 376
11 libdispatch.dylib 0x1af40125c _dispatch_mach_invoke + 456
12 libdispatch.dylib 0x1af3eb784 _dispatch_lane_serial_drain + 376
13 libdispatch.dylib 0x1af3ec438 _dispatch_lane_invoke + 444
14 libdispatch.dylib 0x1af3eb784 _dispatch_lane_serial_drain + 376
15 libdispatch.dylib 0x1af3ec404 _dispatch_lane_invoke + 392
16 libdispatch.dylib 0x1af3f6c98 _dispatch_workloop_worker_thread + 648
17 libsystem_pthread.dylib 0x1af5a4360 _pthread_wqthread + 288
18 libsystem_pthread.dylib 0x1af5a3080 start_wqthread + 8
Note that the thread name "updateAllMedia" is a misnomer because this thread appears to be a general Metal dispatch queue. I wish there was a debugging option in Metal that called "setThreadName" to name its internal threads.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
We have a production Metal app with a complex multithreaded Metal pipeline.
When everything is operating smoothly, it works great.
Even when extremely overloaded, it does not crash for days at a time.
This isn't good enough for our users.
Unfortunately, when I have zero visibility into id, I have no way of knowing when metal is "done" with an id.
When overloaded, stale metal render passes need to be 'aborted', which results in metal callbacks not being called.
for example, these callbacks may not be called after an aborted pass:
id<MTLCommandBuffer> m_cmdbuf;
[m_cmdbuf addScheduledHandler:^(id <MTLCommandBuffer> cb) {
cpr->scheduled = MachAbsoluteTime();
}];
[m_cmdbuf addCompletedHandler:^(id <MTLCommandBuffer> cb) {
cpr->completed = MachAbsoluteTime();
}];
For the moment, our workaround is a system which waits a few seconds after we "think" a rendering pass should be done with all its (aborted) resources before releasing buffers. This is not ideal, to say the least.
So, in summary, my question is, it would be nice to be able to 'query' an id to know when metal is done with it, so that we know that its safe to release it along with our own internal resources.
Is there any such (undocumented) mechanism? I have exhaustively read all existing Metal documentation many times.
An idea that I've been toying with... it would be nice to have something akin to Zombie detection running all the time for id only.
In OpenGL, it was OK to use a released texture... you may display a bad frame, but not CRASH!. Is there any similar option for id?
Topic:
Graphics & Games
SubTopic:
Metal
multicast sockets fail to send/receive on macosx, errno 65 "no route to host".
Wireshark and Terminal.app (which have root privileges) both show incoming multicast traffic just fine.
Normal UDP broadcast sockets have no problems.
Toggling the Security&Privacy -> Local Network setting may fix the problem for some Users.
There is no pattern for when multicast socket fails.
Sometimes, recreating the sockets fix the problem.
Restart the app, sometimes multicast fails, sometimes success (intermittent, no pattern).
Reboot machine (intermittent fail)
Create a fresh new user on machine, install single version of app, give app permission. (intermittent fail, same as above).
We have all the normal entitlements / notarized app.
Similar posts here
see FB16923535, Related to FB16512666
https://forum.xojo.com/t/udp-multicast-receive-on-mac-failing-intermittant/83221
see my post from 2012 "distinguishing between SENDING sockets and RECEIVING sockets" for source code example of how we bind multicast sockets. Our other socket code is standard "Stevens, et al." code. The bind() is the call that fails in this case. https://stackoverflow.com/questions/10692956/what-does-it-mean-to-bind-a-multicast-udp-socket . Note that this post from 2012 is still relevant, and that it is a workaround to a longstanding Apple bug that was never fixed. Namely, "Without this fix, multicast sending will intermittently get sendto() errno 'No route to host'. If anyone can shed light on why unplugging a DHCP gateway causes Mac OS X multicast SENDING sockets to get confused, I would love to hear it."
This may be a hint as to the underlying bug that Apple really needs to fix, but if it's not, then please Apple, fix the Sequoia bug first. These are probably different bugs because in one case, sendto() fails when a socket becomes "unbound" after you unplug an unrelated network cable. In this case, bind() fails, so sendto() is never even called.
Note, that we have also tried to use other implementations for network discovery, including Bonjour, CFNetwork, etc. Bonjour fails intermittently, and also suffers from both bugs mentioned above, amongst others.