Thanks a lot for all those clarifications! I see at least two use cases in which understanding the camera extrinsics is crucial:
An object is tracked with a non-generic algorithm which allows for a much higher tracking accuracy for the specific use case than any other out of the box tracking solution. A pose is computed relative to the camera, where is the object in world space?
A gridded sheet is tracked using ARKit image tracking for a high-feature texture in its center. the user can color each cell of the grid with a set of distinct colors which the system should interpret. Given a 3d coordinate in world space, which pixel area is corresponding in the camera frame?
We are now using the WorldTrackingProvider's queryDeviceAnchor with the current timestamp CACurrentMediaTime(). Is multiplying that with the camera extrinsics the correct approach to get the extrinsics in world space?
For debugging purposes we are now drawing the captured frames onto a canvas which we position at one meter in front of the camera location (as described above) with pixel density 1/focal length, together with a small sphere at the center of the canvas and a tube going from the camera to the center of the canvas. It looks like the "left camera" really is the right camera (from the user's perspective), is that correct?
when rendered for the right eye, the tube seems to to be pointing perfectly forward, slightly from the top left of the display. Does this mean that the whole scene is rendered from the camera's position? If not, what does it mean?
Unfortunately we have still not been able to display the tracked object at the correct pose in world space; there is a consistent offset which is very similar to the offset between the passthrough and the rendered frame on the canvas. Thanks a lot for your assistance so far, that is very much appreciated! I will try to test all the assumptions in a minimal project which we could share if that helps, I will keep you posted here if I make any progress.