I am developing an app with transcription and I am exploring ways to improve the transcription from the SpeechAnalyzer/Transcriber for technical terms. SFSpeech... recognition had the capability of being augmented by contextualStrings. Does something similar exist for SpeechAnalyzer/Transcriber? If so please point me towards the documentation and any sample code that may exist for this. If there are other options, please let me know.
Audio
RSS for tagDive into the technical aspects of audio on your device, including codecs, format support, and customization options.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
My app want Converting iphone12 HDR Video to SDR,to edit。
follow the doc Apple-HDR-Convert.
My code setting the pixBuffAttributes
[pixBuffAttributes setObject:(id)(kCVImageBufferYCbCrMatrix_ITU_R_709_2) forKey:(id)kCVImageBufferYCbCrMatrixKey];
[pixBuffAttributes setObject:(id)(kCVImageBufferColorPrimaries_ITU_R_709_2) forKey:(id)kCVImageBufferColorPrimariesKey];
[pixBuffAttributes setObject:(id)kCVImageBufferTransferFunction_ITU_R_709_2 forKey:(id)kCVImageBufferTransferFunctionKey];
playerItemOutput = [[AVPlayerItemVideoOutput alloc] initWithPixelBufferAttributes:pixBuffAttributes];
but I get the playerItemOutput's output buffer
CFTypeRef colorAttachments = CVBufferGetAttachment(pixelBuffer, kCVImageBufferYCbCrMatrixKey, NULL);
CFTypeRef colorPrimaries = CVBufferGetAttachment(pixelBuffer, kCVImageBufferColorPrimariesKey, NULL);
CFTypeRef colorTransFunc = CVBufferGetAttachment(pixelBuffer, kCVImageBufferTransferFunctionKey, NULL);
NSLog(@"colorAttachments = %@", colorAttachments);
NSLog(@"colorPrimaries = %@", colorPrimaries);
NSLog(@"colorTransFunc = %@", colorTransFunc);
log output:
colorAttachments = ITU_R_2020
colorPrimaries = ITU_R_2020
colorTransFunc = ITU_R_2100_HLG
pixBuffAttributes setting output format invalid,please help!
I'm writing a simple app for iOS and I'd like to be able to do some text to speech in it. I have a basic audio manager class with a "speak" function:
import Foundation
import AVFoundation
class AudioManager {
static let shared = AudioManager()
var audioPlayer: AVAudioPlayer?
var isPlaying: Bool {
return audioPlayer?.isPlaying ?? false
}
var playbackPosition: TimeInterval = 0
func playSound(named name: String) {
guard let url = Bundle.main.url(forResource: name, withExtension: "mp3") else {
print("Sound file not found")
return
}
do {
if audioPlayer == nil || !isPlaying {
audioPlayer = try AVAudioPlayer(contentsOf: url)
audioPlayer?.currentTime = playbackPosition
audioPlayer?.prepareToPlay()
audioPlayer?.play()
} else {
print("Sound is already playing")
}
} catch {
print("Error playing sound: \(error.localizedDescription)")
}
}
func stopSound() {
if let player = audioPlayer {
playbackPosition = player.currentTime
player.stop()
}
}
func speak(text: String) {
let synthesizer = AVSpeechSynthesizer()
let utterance = AVSpeechUtterance(string: text)
utterance.voice = AVSpeechSynthesisVoice(language: "en-GB")
synthesizer.speak(utterance)
}
}
And my app shows text in a ScrollView:
ScrollView {
Text(self.description)
.padding()
.foregroundColor(.black)
.font(.headline)
.background(Color.gray.opacity(0))
}.onAppear {
AudioManager.shared.speak(text: self.description)
}
However, the text doesn't get read out (in the simulator). I see some output in the console:
Error fetching voices: Swift.DecodingError.dataCorrupted(Swift.DecodingError.Context(codingPath: [], debugDescription: "Invalid container metadata for _UnkeyedDecodingContainer, found keyedGraphEncodingNodeID", underlyingError: nil)). Using fallback voices.
I'm probably doing something wrong here, but not sure what.
Topic:
Media Technologies
SubTopic:
Audio
[Note: this issue was happening on a main testing device, and after testing the same code on other devices, this issue is only happening on 1 out of 4 devices]
We are successfully getting a MusicCatalogResourceResponse for every song ID where we make the MusicCatalogResourceRequest. We are able to display the song title, artist name, and album artwork for each Song in the response.
However - when we go to play the song, there are some songs that play, and several songs that do not play. For the songs that don't play, the console shows “Failed to prepareToPlay error=<MPMusicPlayerControllerErrorDomain.6 "Failed to prepare to play" {}>”
let musicPlayer = ApplicationMusicPlayer.shared
func playSong(_ song: Song) {
musicPlayer.queue = [song]
Task {
try await musicPlayer.prepareToPlay()
try await musicPlayer.play()
}
Is there anything else we can investigate about what may be causing specific song IDs not to play on this specific device?
Even if we remove line 6 musicPlayer.prepareToPlay() we still see the same console error when running playSong with the Songs that don't work.
It is always the same song IDs that we can play and always the same song IDs that we cannot get to play, even trying them across different projects with different bundle identifiers.
We can tap to play a song that works, and it starts playing immediately. Then tap a song that doesn't work, and nothing happens. Then back to a song that works. It's consistent which songs succeed and fail on this device.
Perhaps there is an issue specific to this very iPad when it comes to certain specific songs, but we'd like to be confident that an app relying on MusicKit will be able to play songs that have been successfully loaded with a MusicCatalogResourceResponse.
Thanks for any help or suggestions about what we may be able to investigate further on the device or what we should consider when launching an app that expects anyone with Apple Music to be able to listen to any of the songs loaded by the app.
Specific iPad details: iPad Pro (12.9-inch) (6th generation) running iPadOS 26.4 Beta
Two of the song IDs that won't play on this iPad (even though we can access and display their album artwork and all other information): 943204000 and 1441164805
Is there any feasible way to get a Core Audio device's system effect status (Voice Isolation, Wide Spectrum)?
AVCaptureDevice provides convenience properties for system effects for video devices. I need to get this status for Core Audio input devices.
I'm working on adding CarPlay support to an audio app and am running into an issue. Occasionally, when a user opens the app from CarPlay while the main app scene is either not connected or is currently in the background, I will receive an error when attempting to activate the audio session. The code below mimics my setup:
do {
try AVAudioSession.sharedInstance().setCategory(.playback, mode: .spokenAudio)
try AVAudioSession.sharedInstance().setActive(true)
} catch {
print(error) // NSOSStatusErrorDomain - 560557684: Session activation failed
}
That error code maps to AVAudioSession.ErrorCode.cannotInterruptOthers.
Once in this state, all subsequent attempts to play different pieces of content will fail. However, things will start working normally if the user opens the app on their phone and tries again from CarPlay (while the app is in the foreground on their phone).
I'm not sure why it would behave this way and want to note that I do have the audio background mode capability enabled.
Has anyone else encountered this? Are there any workarounds or changes I could make to prevent this from happening?
Hello,
I have been running into issues with setting nowPlayingInfo information, specifically updating information for CarPlay and the CPNowPlayingTemplate.
When I start playback for an item, I see lock screen information update as expected, along with the CarPlay now playing information.
However, the playing items are books with collections of tracks. When I select a new track(chapter) within the book, I set the MPMediaItemPropertyTitle to the new chapter name. This change is reflected correctly on the lock screen, but almost never appears correctly on the CarPlay CPNowPlayingTemplate. The previous chapter title remains set and never updates.
I see "Application exceeded audio metadata throttle limit." in the debug console fairly frequently.
From that a I figured that I need to minimize updates to the nowPlayingInfo dictionary. What I did:
I store the metadata dictionary in a local dictionary and only set values in the main nowPlayingInfo dictionary when they are different from the current value.
I kick off the nowPlayingInfo update via a task that initially sleeps for around 2 seconds (not a final value, just for my current testing). If a previous Task is active, it gets cancelled, so that only one update can happen within that time window.
Neither of these things have been sufficient. I can switch between different titles entirely and the information updates (including cover art).
But when I switch chapters within a title, the MPMediaItemPropertyTitle continues to get dropped. I know the value is getting set, because it updates on the lock screen correctly.
In total, I have 12 keys I update for info, though with the above changes, usually 2-4 of them actually get updated with high frequency.
I am running out of ideas to satisfy the throttling thresholds to accurately display metadata. I could use some advice.
Thanks.
Please include the line below in follow-up emails for this request.
Case-ID: 11089799
When using AVSpeechUtterance and setting it to play in Mandarin, if Siri is set to Cantonese on iOS 18, it will be played in Cantonese. There is no such issue on iOS 17 and 16.
1.let utterance = AVSpeechUtterance(string: textView.text)
let voice = AVSpeechSynthesisVoice(language: "zh-CN")
utterance.voice = voice
2.In the phone settings, Siri is set to Cantonese
In my app I use AVAssetReaderTrackOutput to extract PCM audio from a user-provided video or audio file and display it as a waveform.
Recently a user reported that the waveform is not in sync with his video, and after receiving the video I noticed that the waveform is in fact double as long as the video duration, i.e. it shows the audio in slow-motion, so to speak.
Until now I was using
CMFormatDescription.audioStreamBasicDescription.mSampleRate
which for this particular user video returns 22'050. But in this case it seems that this value is wrong... because the audio file has two audio channels with different sample rates, as returned by
CMFormatDescription.audioFormatList.map({ $0.mASBD.mSampleRate })
The first channel has a sample rate of 44'100, the second one 22'050. If I use the first sample rate, the waveform is perfectly in sync with the video.
The problem is given by the fact that the ratio between the audio data length and the sample rate multiplied by the audio duration is 8, double the ratio for the first audio file (4). In the code below this ratio is given by
Double(length) / (sampleRate * asset.duration.seconds)
When commenting out the line with the sampleRate variable definition in the code below and uncommenting the following line, the ratios for both audio files are 4, which is the expected result. I would expect audioStreamBasicDescription to return the correct sample rate, i.e. the one used by AVAssetReaderTrackOutput, which (I think) somehow merges the stereo tracks. The documentation is sparse, and in particular it’s not documented whether the lower or higher sample rate is used; in this case, it seems like the higher one is used, but audioStreamBasicDescription for some reason returns the lower one.
Does anybody know why this is the case or how I should extract the sample rate of the produced PCM audio data? Should I always take the higher one?
I created FB19620455.
let openPanel = NSOpenPanel()
openPanel.allowedContentTypes = [.audiovisualContent]
openPanel.runModal()
let url = openPanel.urls[0]
let asset = AVURLAsset(url: url)
let assetTrack = asset.tracks(withMediaType: .audio)[0]
let assetReader = try! AVAssetReader(asset: asset)
let readerOutput = AVAssetReaderTrackOutput(track: assetTrack, outputSettings: [AVFormatIDKey: Int(kAudioFormatLinearPCM), AVLinearPCMBitDepthKey: 16, AVLinearPCMIsBigEndianKey: false, AVLinearPCMIsFloatKey: false, AVLinearPCMIsNonInterleaved: false])
readerOutput.alwaysCopiesSampleData = false
assetReader.add(readerOutput)
let formatDescriptions = assetTrack.formatDescriptions as! [CMFormatDescription]
let sampleRate = formatDescriptions[0].audioStreamBasicDescription!.mSampleRate
//let sampleRate = formatDescriptions[0].audioFormatList.map({ $0.mASBD.mSampleRate }).max()!
print(formatDescriptions[0].audioStreamBasicDescription!.mSampleRate)
print(formatDescriptions[0].audioFormatList.map({ $0.mASBD.mSampleRate }))
if !assetReader.startReading() {
preconditionFailure()
}
var length = 0
while assetReader.status == .reading {
guard let sampleBuffer = readerOutput.copyNextSampleBuffer(), let blockBuffer = sampleBuffer.dataBuffer else {
break
}
length += blockBuffer.dataLength
}
print(Double(length) / (sampleRate * asset.duration.seconds))
I’m using the shared instance of AVAudioSession. After activating it with .setActive(true), I observe the outputVolume, and it correctly reports the device’s volume.
However, after deactivating the session using .setActive(false), changing the volume, and then reactivating it again, the outputVolume returns the previous volume (before deactivation), not the current device volume. The correct volume is only reported after the user manually changes it again using physical buttons or Control Center, which triggers the observer.
What I need is a way to retrieve the actual current device volume immediately after reactivating the audio session, even on the second and subsequent activations.
Disabling and re-enabling the audio session is essential to how my application functions.
I’ve tested this behavior with my colleagues, and the issue is consistently reproducible on iOS 18.0.1, iOS 18.1, iOS 18.3, iOS 18.5 and iOS 18.6.2. On devices running iOS 17.6.1 and iOS 16.0.3, outputVolume correctly reflects the current volume immediately after calling .setActive(true) multiple times.
I'm working on a project to support spatial audio editing, using this sample project as a reference: https://developer.apple.com/documentation/Cinematic/editing-spatial-audio-with-an-audio-mix
This sample works well on an unedited capture, but does not work for a capture that has already been edited.
The failure is occurring at "let audioInfo = try await CNAssetSpatialAudioInfo(asset: myAsset)", which is throwing "no eligible audio tracks in asset".
I also find that for already edited captures, if i use CNAssetSpatialAudioInfo.assetContainsSpatialAudio, it returns false.
What i mean by "already edited" is that if I take a spatial capture with my iPhone 16, and then edit that capture in the Photos app using the Cinematic effect, and then save the edited output (e.g. edited_capture.mov), I can't import that edited_capture.mov into my project as a spatial audio asset.
Is this intentional behavior or a bug?
If it's intentional, can you describe why?
Topic:
Media Technologies
SubTopic:
Audio
Hello,
I have a CarPlay Navigation app and utilize the AVSpeechSynthesizer to speak directions to a user. Everything works great on my CarPlay simulator as well as when plugged into my GMC truck. However, I found out yesterday that one of my users with a Ford truck the audio would cut in an out.
After much troubleshooting, I was able to replicate this on my own truck when using Bluetooth to connect to CarPlay. My user was also utilizing Bluetooth. Has anyone else experienced this? Is there a fix to the problem?
import SwiftUI
import AVFoundation
class TextToSpeechService: NSObject, ObservableObject, AVSpeechSynthesizerDelegate {
private var speechSynthesizer = AVSpeechSynthesizer()
static let shared = TextToSpeechService()
override init() {
super.init()
speechSynthesizer.delegate = self
}
func configureAudioSession() {
speechSynthesizer.delegate = self
do {
try AVAudioSession.sharedInstance().setCategory(.playback, mode: .voicePrompt, options: [.mixWithOthers, .allowBluetooth])
} catch {
print("Failed to set audio session category: \(error.localizedDescription)")
}
}
func speak(_ text: String) {
Task(priority: .high) {
let speechUtterance = AVSpeechUtterance(string: text)
speechUtterance.voice = AVSpeechSynthesisVoice(language: AVSpeechSynthesisVoice.currentLanguageCode())
try AVAudioSession.sharedInstance().setActive(true, options: .notifyOthersOnDeactivation)
speechSynthesizer.speak(speechUtterance)
}
}
func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer, didFinish utterance: AVSpeechUtterance) {
Task {
stopSpeech()
try AVAudioSession.sharedInstance().setActive(false)
}
}
func stopSpeech() {
speechSynthesizer.stopSpeaking(at: .immediate)
}
}
Description: I have identified a specific issue when recording acoustic guitar and other instruments on the iPhone 17 Pro Max using native applications (Voice Memos, Camera). The recordings contain an unnatural metallic resonance (ringing artifacts) that should not be present.
Testing and Methodology:
Hardware Verification: Initially, I suspected a hardware defect in the audio chip or microphone. However, extensive testing with third-party software suggests this is likely a software-level issue.
AudioShare Test: I conducted a test using the AudioShare app in "Measurement Mode" (which bypasses standard iOS system-wide audio processing). In this mode, the audio remains perfectly clean, and the metallic ringing disappears entirely.
Conclusion: The issue is rooted in the DSP (Digital Signal Processing) algorithms that iOS applies for noise suppression or voice enhancement. These algorithms appear to misinterpret the high-frequency overtones of acoustic instruments as background noise and attempt to "filter" them, resulting in audible digital artifacts.
Comparison Results: This issue has not been observed on devices from other brands or on older iPhone models (preliminary tests suggest older versions handle this better). Notably, the problem persists even in GarageBand, as the app still utilizes certain system-level processing layers.
Proposed Solution: I suggest adding a "Raw Audio" or "Instrument Mode" toggle within the Microphone/Audio settings for native apps. This mode should disable aggressive DSP processing, similar to how the AVAudioSession.Mode.measurement works in specialized apps.
Attachments: I am attaching 4 archives, including a final "Measurement Mode" folder with comparative samples (Measurement Mode vs. Standard Mode). The artifacts are most prominent when monitored through headphones.
using iOS 26.2; Airpods 4
Long press stem to launch Siri
Speak "Record Voice Memo" -> Recording starts
Recording in progress...
Long press stem to launch Siri -> Nothing happens.
To stop recording need use phone.
is this intended behaviour?
i would like to be able to stop recording with Siri
I am able to launch Siri from phone while recording, but point is to keep phone in pocket and start/stop recordings only via Airpods.
Hi, when using ApplicationMusicPlayer from MusicKit my app automatically gets the media controls on the lock screen: Play/ Pause, Skip Buttons, Playback Position etc.
I would like to customize these. Tried a bunch of things, e.g. using MPRemoteCommandCenter. So far I haven't had any success.
Does anyone know how I can customize the media controls of ApplicationMusicPlayer.
Thank you.
Please consider adding the ability to programatically download Premium and Enhanced voices. At the moment it is extremely inconvenient for our users, as they have to navigate to settings themselves to download voices. Our app relies heavily on SpeechSynthesis integration, and it would greatly benefit from this feature.
FB16307193
Hi,
I’m an iOS developer building an app with an use case that needs advanced playback on Apple Music subscription streams, specifically:
• Real-time tempo change (BPM) during playback — i.e., time-stretch with key-lock, not just crossfade.
• Beat-matched transitions between tracks.
From what I can tell, this capability seems to exist only for approved partners and isn’t available through public MusicKit.
Question: What’s the official request path to be evaluated for that restricted partner entitlement (application form, questionnaire, NDA, or internal team/BD contact)? If the entitlement identifier is internal, how can I get my account routed to the right Apple Music team?
For reference, publicly announced partners include Algoriddim djay, Serato DJ Pro, rekordbox (AlphaTheta), and Engine DJ—all of which appear to implement mixing features that imply advanced playback (tempo/beat-matching) on Apple Music content. I’d prefer not to share product details publicly for the moment and can provide specifics privately if needed.
Thanks in advance!
Topic:
Media Technologies
SubTopic:
Audio
Tags:
Apple Music API
FairPlay Streaming
MusicKit
AVFoundation
I've been wondering if there is a way to modify or even disable tones for indicating channel states. The behaviour regarding tones seems like a black box with little documentation.
During migration to Apple's PT Framework we've noticed that there are few scenarios where a tone is played which doesn't match certain certifications. For example; moving from a channel to another produces a tone which would fail a test case. I understand the reasoning fully, as it marks that the channel is ready to transmit or receive, but this doesn't mirror the behaviour of TETRA which would be wanted in this case.
I'm also wondering if there would be any way to directly communicate feedback regarding PT Framework?
Hello,
I have an iOS app that is recording audio that is working fine on iPads/iPhones. It asks for microphone permission and after that recording works.
I installed the same app on my M3 MacBook via TestFlight since iPad apps are supposed to work without a change that way. The app starts fine and everything, but it never asks for Microphone permission, so I can't record.
Do I need to do something to make this happen (this is not macCatalyst, its running the arm64 iPhone binary on macOS)
thanks
Hello,
We are developing a real-time speech recognition application and are utilizing AVAudioEngine with voice processing enabled on the input node. However, we have observed that enabling this mode interferes with the built-in iOS screen recording feature - specifically, the recorded video does not capture any audio when this mode is active.
Since we want users to be able to record their experience within our app, this issue significantly impacts our functionality. Is there a known workaround or recommended approach to ensure that both voice processing and screen recording can function simultaneously?
Any guidance would be greatly appreciated.
Thank you!