Real Time Text detection using iOS18 RecognizeTextRequest from video buffer returns gibberish

Hey Devs, I'm trying to create my own Real Time Text detection like this Apple project. https://developer.apple.com/documentation/vision/extracting-phone-numbers-from-text-in-images I want to use the new iOS18 RecognizeTextRequest instead of the old VNRecognizeTextRequest in my SwiftUI project. This is my delegate code with the camera setup. I removed region of interest for debugging but I'm trying to scan English words in books. The idea is to get one word in the ROI in the future. But I can't even get proper words so testing without ROI incase my math is wrong.

@Observable
class CameraManager: NSObject, AVCapturePhotoCaptureDelegate
...
    override init() {
        super.init()
        setUpVisionRequest()
    }
    private func setUpVisionRequest() {
        textRequest = RecognizeTextRequest(.revision3)
    }
...
func setup() -> Bool {
        captureSession.beginConfiguration()

        guard
            let captureDevice = AVCaptureDevice.default(
                .builtInWideAngleCamera, for: .video, position: .back)
        else {
            return false
        }
        self.captureDevice = captureDevice
        guard let deviceInput = try? AVCaptureDeviceInput(device: captureDevice)
        else {
            return false
        }

        /// Check whether the session can add input.
        guard captureSession.canAddInput(deviceInput) else {
            print("Unable to add device input to the capture session.")
            return false
        }

        /// Add the input and output to session
        captureSession.addInput(deviceInput)

        /// Configure the video data output
        videoDataOutput.setSampleBufferDelegate(
            self, queue: videoDataOutputQueue)

        if captureSession.canAddOutput(videoDataOutput) {
            captureSession.addOutput(videoDataOutput)
            videoDataOutput.connection(with: .video)?
                .preferredVideoStabilizationMode = .off
        } else {
            return false
        }

        // Set zoom and autofocus to help focus on very small text
        do {
            try captureDevice.lockForConfiguration()
            captureDevice.videoZoomFactor = 2
            captureDevice.autoFocusRangeRestriction = .near
            captureDevice.unlockForConfiguration()
        } catch {
            print("Could not set zoom level due to error: \(error)")
            return false
        }
        captureSession.commitConfiguration()

        // potential issue with background vs dispatchqueue ??
        Task(priority: .background) {
            captureSession.startRunning()
        }
        return true
    }

}
// Issue here ???
extension CameraManager: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(
        _ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer,
        from connection: AVCaptureConnection
    ) {
        guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
        Task {
            
            textRequest.recognitionLevel = .fast
            textRequest.recognitionLanguages = [Locale.Language(identifier: "en-US")]
            
            do {
                let observations = try await textRequest.perform(on: pixelBuffer)
                
                for observation in observations {
                    
                    let recognizedText = observation.topCandidates(1).first
                    print("recognized text \(recognizedText)")
                }
            } catch {
                print("Recognition error: \(error.localizedDescription)")
            }
        }
    }
}

The results I get look like this ( full page of English from a any book) recognized text Optional(RecognizedText(string: e bnUI W4, confidence: 0.5)) recognized text Optional(RecognizedText(string: ?'U, confidence: 0.3)) recognized text Optional(RecognizedText(string: traQt4, confidence: 0.3)) recognized text Optional(RecognizedText(string: li, confidence: 0.3)) recognized text Optional(RecognizedText(string: 15,1,#, confidence: 0.3)) recognized text Optional(RecognizedText(string: jllÈ, confidence: 0.3)) recognized text Optional(RecognizedText(string: vtrll, confidence: 0.3)) recognized text Optional(RecognizedText(string: 5,1,: 11, confidence: 0.5)) recognized text Optional(RecognizedText(string: 1141, confidence: 0.3)) recognized text Optional(RecognizedText(string: jllll ljiiilij41, confidence: 0.3)) recognized text Optional(RecognizedText(string: 2f4, confidence: 0.3)) recognized text Optional(RecognizedText(string: ktril, confidence: 0.3)) recognized text Optional(RecognizedText(string: ¥LLI, confidence: 0.3)) recognized text Optional(RecognizedText(string: 11[Itl,, confidence: 0.3)) recognized text Optional(RecognizedText(string: 'rtlÈ131, confidence: 0.3))

Even with ROI set to a specific rectangle Normalized to Vision, I get the same results with single characters returning gibberish.

Any help would be amazing thank you.

  1. Am I using the buffer right ?
  2. Am I using the new perform(on: CVPixelBuffer) right ?
  3. Maybe I didn't set up my camera properly? I can provide code
Answered by candyline in 851252022

Ladies and gentlemen. I found the solution. I thought portrait was orientation.up but this is what the docs had to say. When the user captures a photo while holding the device in portrait orientation, iOS writes an orientation value of CGImagePropertyOrientation.right so with a small change of try await textRequest.perform(on: pixelBuffer, orientation: .right) it resulted in the correct text detection.

Accepted Answer

Ladies and gentlemen. I found the solution. I thought portrait was orientation.up but this is what the docs had to say. When the user captures a photo while holding the device in portrait orientation, iOS writes an orientation value of CGImagePropertyOrientation.right so with a small change of try await textRequest.perform(on: pixelBuffer, orientation: .right) it resulted in the correct text detection.

Real Time Text detection using iOS18 RecognizeTextRequest from video buffer returns gibberish
 
 
Q