I've been looking through Apple's sample code Building a Feature-Rich App for Sports Analysis - https://developer.apple.com/documentation/vision/building_a_feature-rich_app_for_sports_analysis and its associated WWDC video to learn to reason about AVFoundation and VNDetectTrajectoriesRequest - https://developer.apple.com/documentation/vision/vndetecttrajectoriesrequest. My goal is to allow the user to import videos (this part I have working, the user sees a UIDocumentBrowserViewController - https://developer.apple.com/documentation/uikit/uidocumentbrowserviewcontroller, picks a video file, and then a copy is made), but I only want segments of the original video copied where trajectories are detected from a ball moving.
I've tried as best I can to grasp the two parts, at the very least finding where the video copy is made and where the trajectory request is made.
The full video copy happens in CameraViewController.swift (I'm starting with just imported video for now and not reading live from the device's video camera), line 160:func startReadingAsset(_ asset: AVAsset) {
videoRenderView = VideoRenderView(frame: view.bounds)
setupVideoOutputView(videoRenderView)
let displayLink = CADisplayLink(target: self, selector: #selector(handleDisplayLink(:)))
displayLink.preferredFramesPerSecond = 0
displayLink.isPaused = true
displayLink.add(to: RunLoop.current, forMode: .default)
guard let track = asset.tracks(withMediaType: .video).first else {
AppError.display(AppError.videoReadingError(reason: "No video tracks found in AVAsset."), inViewController: self)
return
}
let playerItem = AVPlayerItem(asset: asset)
let player = AVPlayer(playerItem: playerItem)
let settings = [
String(kCVPixelBufferPixelFormatTypeKey): kCVPixelFormatType420YpCbCr8BiPlanarFullRange
]
let output = AVPlayerItemVideoOutput(pixelBufferAttributes: settings)
playerItem.add(output)
player.actionAtItemEnd = .pause
player.play()
self.displayLink = displayLink
self.playerItemOutput = output
self.videoRenderView.player = player
let affineTransform = track.preferredTransform.inverted()
let angleInDegrees = atan2(affineTransform.b, affineTransform.a) * CGFloat(180) / CGFloat.pi
var orientation: UInt32 = 1
switch angleInDegrees {
case 0:
orientation = 1 // Recording button is on the right
case 180, -180:
orientation = 3 // abs(180) degree rotation recording button is on the right
case 90:
orientation = 8 // 90 degree CW rotation recording button is on the top
case -90:
orientation = 6 // 90 degree CCW rotation recording button is on the bottom
default:
orientation = 1
}
videoFileBufferOrientation = CGImagePropertyOrientation(rawValue: orientation)!
videoFileFrameDuration = track.minFrameDuration
displayLink.isPaused = false
}
@objc
private func handleDisplayLink(_ displayLink: CADisplayLink) {
guard let output = playerItemOutput else {
return
}
videoFileReadingQueue.async {
let nextTimeStamp = displayLink.timestamp + displayLink.duration
let itemTime = output.itemTime(forHostTime: nextTimeStamp)
guard output.hasNewPixelBuffer(forItemTime: itemTime) else {
return
}
guard let pixelBuffer = output.copyPixelBuffer(forItemTime: itemTime, itemTimeForDisplay: nil) else {
return
}
// Create sample buffer from pixel buffer
var sampleBuffer: CMSampleBuffer?
var formatDescription: CMVideoFormatDescription?
CMVideoFormatDescriptionCreateForImageBuffer(allocator: nil, imageBuffer: pixelBuffer, formatDescriptionOut: &formatDescription)
let duration = self.videoFileFrameDuration
var timingInfo = CMSampleTimingInfo(duration: duration, presentationTimeStamp: itemTime, decodeTimeStamp: itemTime)
CMSampleBufferCreateForImageBuffer(allocator: nil,
imageBuffer: pixelBuffer,
dataReady: true,
makeDataReadyCallback: nil,
refcon: nil,
formatDescription: formatDescription!,
sampleTiming: &timingInfo,
sampleBufferOut: &sampleBuffer)
if let sampleBuffer = sampleBuffer {
self.outputDelegate?.cameraViewController(self, didReceiveBuffer: sampleBuffer, orientation: self.videoFileBufferOrientation)
DispatchQueue.main.async {
let stateMachine = self.gameManager.stateMachine
if stateMachine.currentState is GameManager.SetupCameraState {
// Once we received first buffer we are ready to proceed to the next state
stateMachine.enter(GameManager.DetectingBoardState.self)
}
}
}
}
}
Line 139 self.outputDelegate?.cameraViewController(self, didReceiveBuffer: sampleBuffer, orientation: self.videoFileBufferOrientation) is where the video sample buffer is passed to the Vision framework subsystem for analyzing trajectories, the second part. This delegate callback is implemented in GameViewController.swift on line 335:
// Perform the trajectory request in a separate dispatch queue.
trajectoryQueue.async {
do {
try visionHandler.perform([self.detectTrajectoryRequest])
if let results = self.detectTrajectoryRequest.results {
DispatchQueue.main.async {
self.processTrajectoryObservations(controller, results)
}
}
} catch {
AppError.display(error, inViewController: self)
}
}
Trajectories found are drawn over the video in self.processTrajectoryObservations(controller, results).
Where I'm stuck now is modifying this so that instead of drawing the trajectories, the new video only copies parts of the original video to it where trajectories were detected in the frame.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
What might be a good way to constrain a view's top anchor to be just at the edge of a device's Face ID sensor housing if it has one?
This view is a product photo that would be clipped too much if it ignored the top safe area inset, but if it was positioned relative to the top safe area margin this wouldn't be ideal either because of the slight gap between the sensor housing and the view (the view is a photo of pants cropped at the waist). What might be a good approach here?
I have an app that currently depends on fetching the model through CloudKit, and is composed of value types. I'm considering adding Core Data support so that record modifications are robust regardless of network conditions.
Core Data resources seem to always assume a model layer with reference semantics, so I'm not sure where to begin.
Should I keep my top-level model type a struct? Can I? If I move my model to reference semantics, how might I bridge from past model instances that are fetched through CloudKit and then decoded?
Thank you in advance.
I am saving time ranges from an input video asset where trajectories are found, then exporting only those segments to an output video file.
Currently I track these time ranges in a stored property var timeRangesOfInterest: [Double : CMTimeRange], which is set in the trajectory request's completion handler
func completionHandler(request: VNRequest, error: Error?) {
guard let request = request as? VNDetectTrajectoriesRequest else { return }
if let results = request.results,
results.count > 0 {
for result in results {
var timeRange = result.timeRange
timeRange.start = timeRange.start - self.assetWriterStartTime
self.timeRangesOfInterest[timeRange.start.seconds] = timeRange
}
}
}
Then these time ranges of interest are used in an export session to only export those segments
/*
Finish writing, and asynchronously evaluate the results from writing
the samples.
*/
assetReaderWriter.writingCompleted { result in
self.exportVideoTimeRanges(timeRanges: self.timeRangesOfInterest.map { $0.1 }) { result in
completionHandler(result)
}
}
Unfortunately however, I'm getting repeated trajectory video segments in the outputted video. Is this maybe because trajectory requests return "in progress" repeated trajectory results with slightly different time range start times? What might be a good strategy for avoiding or removing them? I noticed trajectory segments will appear out of order in the output as well.
I'm building a feature to automatically edit out all the downtime of a tennis video. I have a partial implementation that stores the start and end times of Vision trajectory detections and writes only those segments to an AVFoundation export session.
I've encountered a major issue, which is that the trajectories returned end whenever the ball bounce, so each segment is just one tennis shot and nowhere close to an entire rally with multiple bounces. I'm ensure if I should continue done the trajectory route, maybe stitching together the trajectories and somehow only splitting at the start and end of a rally.
Any general guidance would be appreciated.
Is there a different Vision or ML approach that would more accurately model the start and end time of a rally? I considered creating a custom action classifier to classify frames to be either "playing tennis" or "inactivity," but I started with Apple's trajectory detection since it was already built and trained. Maybe a custom classifier would be needed, but not sure.
My goal is to mark any tennis video's timestamps of both the start of each rally/point and the end of each rally/point. I tried trajectory detection, but the "end time" is when the ball bounces rather than when the rally/point ends. I'm not quite sure what direction to go from here to improve on this. Would action classification of body poses in each frame (two classes, "playing" and "not playing") be the best way to split the video into segments? A different technique?
I followed Apple's guidance in their articles Creating an Action Classifier Model, Gathering Training Videos for an Action Classifier, and Building an Action Classifier Data Source. With this Core ML model file now imported in Xcode, how do use it to classify video frames?
For each video frame I call
do {
let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer)
try requestHandler.perform([self.detectHumanBodyPoseRequest])
} catch {
print("Unable to perform the request: \(error.localizedDescription).")
}
But it's unclear to me how to use the results of the VNDetectHumanBodyPoseRequest which come back as the type [VNHumanBodyPoseObservation]?. How would I feed to the results into my custom classifier, which has an automatically generated model class TennisActionClassifier.swift? The classifier is for making predictions on the frame's body poses, labeling the actions as either playing a rally/point or not playing.
Modifying guidance given in an answer on AVFoundation + Vision trajectory detection, I'm instead saving time ranges of frames that have a specific ML label from my custom action classifier:
private lazy var detectHumanBodyPoseRequest: VNDetectHumanBodyPoseRequest = {
let detectHumanBodyPoseRequest = VNDetectHumanBodyPoseRequest(completionHandler: completionHandler)
return detectHumanBodyPoseRequest
}()
var timeRangesOfInterest: [Int : CMTimeRange] = [:]
private func readingAndWritingDidFinish(assetReaderWriter: AVAssetReaderWriter,
asset
completionHandler: @escaping FinishHandler) {
if isCancelled {
completionHandler(.success(.cancelled))
return
}
// Handle any error during processing of the video.
guard sampleTransferError == nil else {
assetReaderWriter.cancel()
completionHandler(.failure(sampleTransferError!))
return
}
// Evaluate the result reading the samples.
let result = assetReaderWriter.readingCompleted()
if case .failure = result {
completionHandler(result)
return
}
/*
Finish writing, and asynchronously evaluate the results from writing
the samples.
*/
assetReaderWriter.writingCompleted { result in
self.exportVideoTimeRanges(timeRanges: self.timeRangesOfInterest.map { $0.value }) { result in
completionHandler(result)
}
}
}
func exportVideoTimeRanges(timeRanges: [CMTimeRange], completion: @escaping (Result<OperationStatus, Error>) -> Void) {
let inputVideoTrack = self.asset.tracks(withMediaType: .video).first!
let composition = AVMutableComposition()
let compositionTrack = composition.addMutableTrack(withMediaType: .video, preferredTrackID: kCMPersistentTrackID_Invalid)!
var insertionPoint: CMTime = .zero
for timeRange in timeRanges {
try! compositionTrack.insertTimeRange(timeRange, of: inputVideoTrack, at: insertionPoint)
insertionPoint = insertionPoint + timeRange.duration
}
let exportSession = AVAssetExportSession(asset: composition, presetName: AVAssetExportPresetHighestQuality)!
try? FileManager.default.removeItem(at: self.outputURL)
exportSession.outputURL = self.outputURL
exportSession.outputFileType = .mov
exportSession.exportAsynchronously {
var result: Result<OperationStatus, Error>
switch exportSession.status {
case .completed:
result = .success(.completed)
case .cancelled:
result = .success(.cancelled)
case .failed:
// The `error` property is non-nil in the `.failed` status.
result = .failure(exportSession.error!)
default:
fatalError("Unexpected terminal export session status: \(exportSession.status).")
}
print("export finished: \(exportSession.status.rawValue) - \(exportSession.error)")
completion(result)
}
}
This worked fine with results vended from Apple's trajectory detection, but using my custom action classifier TennisActionClassifier (Core ML model exported from Create ML), I get the console error getSubtractiveDecodeDuration signalled err=-16364 (kMediaSampleTimingGeneratorError_InvalidTimeStamp) (Decode timestamp is earlier than previous sample's decode timestamp.) at MediaSampleTimingGenerator.c:180. Why might this be?
After creating a custom action classifier in Create ML, previewing it (see the bottom of the page) with an input video shows the label associated with a segment of the video. What would be a good way to store the duration for a given label, say, each CMTimeRange of segment of video frames that are classified as containing "Jumping Jacks?"
I previously found that storing time ranges of trajectory results was convenient, since each VNTrajectoryObservation vended by Apple had an associated CMTimeRange.
However, using my custom action classifier instead, each VNObservation result's CMTimeRange has a duration value that's always 0.
func completionHandler(request: VNRequest, error: Error?) {
guard let results = request.results as? [VNHumanBodyPoseObservation] else {
return
}
if let result = results.first {
storeObservation(result)
}
do {
for result in results where try self.getLastTennisActionType(from: [result]) == .playing {
var fileRelativeTimeRange = result.timeRange
fileRelativeTimeRange.start = fileRelativeTimeRange.start - self.assetWriterStartTime
self.timeRangesOfInterest[Int(fileRelativeTimeRange.start.seconds)] = fileRelativeTimeRange
}
} catch {
print("Unable to perform the request: \(error.localizedDescription).")
}
}
In this case I'm interested in frames with the label "Playing" and successfully classify them, but I'm not sure where to go from here to track the duration of video segments with consecutive frames that have that label.
I'm having trouble reasoning about and modifying the Detecting Human Actions in a Live Video Feed sample code since I'm new to Combine.
// ---- [MLMultiArray?] -- [MLMultiArray?] ----
// Make an activity prediction from the window.
.map(predictActionWithWindow)
// ---- ActionPrediction -- ActionPrediction ----
// Send the action prediction to the delegate.
.sink(receiveValue: sendPrediction)
These are the final two operators of the video processing pipeline, where the action prediction occurs. In either the implementation for private func predictActionWithWindow(_ currentWindow: [MLMultiArray?]) -> ActionPrediction or for private func sendPrediction(_ actionPrediction: ActionPrediction), how might I access the results of a VNHumanBodyPoseRequest that's retrieved and scoped in a function called earlier in the daisy chain?
When I did this imperatively, I accessed results in the VNDetectHumanBodyPoseRequest completion handler, but I'm not sure how data flow would work with Combine's programming model. I want to associate predictions with the observation results they're based on so that I can store the time range of a given prediction label.
How should I think about video quality (if it's important) when gathering training videos? Does higher video quality of training data make for better predictions, or should it more closely match the common use case (1080p I suppose, thinking about iPhones broadly)?
Would might be a good approach to estimating a VNVideoProcessor operation? I'd like to show a progress bar that's useful enough like one based the progress Apple vends for the photo picker or exports. This would make a world of difference compared to a UIActivityIndicatorView, but I'm not sure how to approach handrolling this (or if that would even be a good idea).
I filed an API enhancement request for this, FB9888210.
For a Create ML activity classifier, I’m classifying “playing” tennis (the points or rallies) and a second class “not playing” to be the negative class. I’m not sure what to specify for the action duration parameter given how variable a tennis point or rally can be, but I went with 10 seconds since it seems like the average duration for both the “playing” and “not playing” labels.
When choosing this parameter however, I’m wondering if it affects performance, both speed of video processing and accuracy. Would the Vision framework return more results with smaller action durations?
My activity classifier is used in tennis sessions, where there are necessarily multiple people on the court. There is also a decent chance other courts' players will be in the shot, depending on the angle and lens.
For my training data, would it be best to crop out adjacent courts?
How do you determine which threads run loop to receive events, in the context of Combine publishers?
This Mac Catalyst tutorial (https://developer.apple.com/tutorials/mac-catalyst/adding-items-to-the-sidebar) shows the following code snippet:
recipeCollectionsSubscriber = dataStore.$collections
.receive(on: RunLoop.main)
.sink { [weak self] _ in
guard let self = self else { return }
let snapshot = self.collectionsSnapshot()
self.dataSource.apply(snapshot, to: .collections, animatingDifferences: true)
}