Does Core Spotlight work with document-based apps?

I have a SwiftUI document-based app that for the sake of this discussion stores accounting information: chart of accounts, transactions, etc. Each document is backed by a SwiftData DB.

I'd like to incorporate search into the app so that users can find transactions matching certain criteria, so I went to Core Spotlight. Indexing & search within the app seem to work well.

The issue is that Spotlight APIs appear to be App based & not Document based. I can't find a way to separate Spotlight data by document.

I've tried having each document maintain a UUID as a document-specific identifier and include the identifier in every CSSearchableItem. When performing a query I filter the results with CSUserQueryContext.filterQueries that filter by the document identifier. That works to limit results to the specific file for search operations.

Index updates via CSSearchableIndexDelegate.reindex* methods seem to be App-centric. A user may have file #1 open, but the delegate is being asked to update CSSearchableItems for IDs in other files.

  1. Is there a proper way to use Spotlight for in-app search with a document-based app?
  2. Is there a way to keep Spotlight-indexed data local within the app & not make it available across the system? I.e. I'd like to search within the app only. System-level searches should not surface this data.
Answered by DTS Engineer in 845631022

The issue is that Spotlight APIs appear to be App based & not Document based.

Sort of. I think the better way to understand this is that the API was intentionally broadened to cover non-document data, but that shift also makes the API appear more "app based".

I can't find a way to separate Spotlight data by document.

Note the "contentURL" property of CSSearchableItemAttributeSet, which is how you'd note the document location. So you'll end up creating multiple CSSearchableItems for every document, all of which (for a given document) will have the same content URL.

I've tried having each document maintain a UUID as a document-specific identifier and include the identifier in every CSSearchableItem. When performing a query I filter the results with CSUserQueryContext.filterQueries that filter by the document identifier. That works to limit results to the specific file for search operations.

My guess here is that this is fairly slow, because you're basically searching "everything" and then discarding results "down" to the target file. I think you'll find that including contentURL as part of the initial query makes things significantly faster.

Index updates via CSSearchableIndexDelegate.reindex* methods seem to be App-centric. A user may have file #1 open, but the delegate is being asked to update CSSearchableItems for IDs in other files.

This depends on what you ask it to do. reindexSearchableItemsWithIdentifiers should only index items with that identifier which you, presumably, have constrained to your file. However, the other option is to more "manually" update the index by using deleteSearchableItems(withIdentifiers:completionHandler:) to delete stale data and/or indexSearchableItems(_:completionHandler:) to add/update new/existing data.

Finally, some general comments on this kind of situation:

I have a SwiftUI document-based app that for the sake of this discussion stores accounting information: chart of accounts, transactions, etc. Each document is backed by a SwiftData DB.

Keep in mind that CoreSpotlight's larger goal is to make your apps content more broadly searchable by the larger system, NOT (necessarily) to serve as the "full" search system for all of your content. In more concrete terms, I think there's a spectrum that runs from something like:

  • A music player, where the metadata of each is well "known" and everything the user can search their library for in-app might reasonably be available system wide.

  • A spreadsheet, where a user might want to search for "22.43" but is unlikely to search for that system wide.

The custom data field system does allow (I think) you to use CoreSpotlight as the index system for data you don't actually need/want the system to search, however, the right approach here depends on the nature of your app and the data you're working with. As a contrived example, text editors should probably not try and implement arbitrary text search on "top" of CoreSpotlight.

Also, keep in mind that being document based does create edge cases which are easy to overlook. Case in point, this doesn't actually work:

I've tried having each document maintain a UUID as a document-specific identifier and include the identifier in every CSSearchableItem.

...since the user can arbitrarily duplicate the same file. In the worst case, you could conceivable end up searching an older/new copy of the file you've opened (because the other file was indexed and the current one wasn't). Similarly, you can also end up with documents with the same UUID having totally different content. contentURL lets you avoid that, (by constraining you to the relevant file) but that also means you may need to index a file you just opened.

I'm not aware of any solution that will REALLY address all of these issues, it's more a matter of working through the edge cases and finding a solution that's "reasonable".

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Accepted Answer

The issue is that Spotlight APIs appear to be App based & not Document based.

Sort of. I think the better way to understand this is that the API was intentionally broadened to cover non-document data, but that shift also makes the API appear more "app based".

I can't find a way to separate Spotlight data by document.

Note the "contentURL" property of CSSearchableItemAttributeSet, which is how you'd note the document location. So you'll end up creating multiple CSSearchableItems for every document, all of which (for a given document) will have the same content URL.

I've tried having each document maintain a UUID as a document-specific identifier and include the identifier in every CSSearchableItem. When performing a query I filter the results with CSUserQueryContext.filterQueries that filter by the document identifier. That works to limit results to the specific file for search operations.

My guess here is that this is fairly slow, because you're basically searching "everything" and then discarding results "down" to the target file. I think you'll find that including contentURL as part of the initial query makes things significantly faster.

Index updates via CSSearchableIndexDelegate.reindex* methods seem to be App-centric. A user may have file #1 open, but the delegate is being asked to update CSSearchableItems for IDs in other files.

This depends on what you ask it to do. reindexSearchableItemsWithIdentifiers should only index items with that identifier which you, presumably, have constrained to your file. However, the other option is to more "manually" update the index by using deleteSearchableItems(withIdentifiers:completionHandler:) to delete stale data and/or indexSearchableItems(_:completionHandler:) to add/update new/existing data.

Finally, some general comments on this kind of situation:

I have a SwiftUI document-based app that for the sake of this discussion stores accounting information: chart of accounts, transactions, etc. Each document is backed by a SwiftData DB.

Keep in mind that CoreSpotlight's larger goal is to make your apps content more broadly searchable by the larger system, NOT (necessarily) to serve as the "full" search system for all of your content. In more concrete terms, I think there's a spectrum that runs from something like:

  • A music player, where the metadata of each is well "known" and everything the user can search their library for in-app might reasonably be available system wide.

  • A spreadsheet, where a user might want to search for "22.43" but is unlikely to search for that system wide.

The custom data field system does allow (I think) you to use CoreSpotlight as the index system for data you don't actually need/want the system to search, however, the right approach here depends on the nature of your app and the data you're working with. As a contrived example, text editors should probably not try and implement arbitrary text search on "top" of CoreSpotlight.

Also, keep in mind that being document based does create edge cases which are easy to overlook. Case in point, this doesn't actually work:

I've tried having each document maintain a UUID as a document-specific identifier and include the identifier in every CSSearchableItem.

...since the user can arbitrarily duplicate the same file. In the worst case, you could conceivable end up searching an older/new copy of the file you've opened (because the other file was indexed and the current one wasn't). Similarly, you can also end up with documents with the same UUID having totally different content. contentURL lets you avoid that, (by constraining you to the relevant file) but that also means you may need to index a file you just opened.

I'm not aware of any solution that will REALLY address all of these issues, it's more a matter of working through the edge cases and finding a solution that's "reasonable".

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I think you'll find that including contentURL as part of the initial query makes things significantly faster.

Unfortunately CSUserQuery does not allow these types of query strings. CSUserQueryContext.filterQueries appears to be the only way to scope to contentURL.

The issue is that Spotlight APIs appear to be App based & not Document based.

Imagine that a user has created 2 documents, #1 & #2, but only #1 is currently open.

The CSSearchableIndexDelegate.reindex* methods may be called by the system with identifiers from both files, but the app only has access to the DB for the open file.

  1. What is the correct way to handle this case? Only update the items for which #1 contains the identifiers? This will leave the indexed items for #2 in an invalid state. Should the acknowledgementHandler be called?

  2. Should the acknowledgementHandler be called if an error is encountered by a reindex* method?

  3. Swift 6.2, Swift 6 language mode: the reindex* methods call indexSearchableItems which in their callbacks need to invoke the acknowledgementHandler. However the callback is @Sendable but the acknowledgementHandler is not.

searchableIndex.indexSearchableItems(items) { error in
  if let error { … }
  acknowledgementHandler()
}

Capture of 'acknowledgementHandler' with non-sendable type '() -> Void' in a '@Sendable' closure

Unfortunately CSUserQuery does not allow these types of query strings. CSUserQueryContext.filterQueries appears to be the only way to scope to contentURL.

That's unfortunate. Well, looking at our code again, I think perfomance loss will be smaller than I'd thought, as the filterQueries is actually being passed over and processed on the daemon side.

The CSSearchableIndexDelegate.reindex* methods may be called by the system with identifiers from both files, but the app only has access to the DB for the open file.

Actually, this reminded me of something I'd overlooked. The correct answer is to create a CSImportExtension. That extension point hands you the file you're indexing, side stepping the entire issue. FYI, this extension point has been broken on previous macOS versions, but does appear to work on macOS 15.

  1. Swift 6.2, Swift 6 language mode: the reindex* methods call indexSearchableItems which in their callbacks need to invoke the acknowledgementHandler. However the callback is @Sendable but the acknowledgementHandler is not.

Anytime you run into this, the answer is to file a bug. While our framework should fully support Swift 6, that certainly isn't true today.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

filterQueries is actually being passed over and processed on the daemon side

Excellent.

The correct answer is to create a CSImportExtension. That extension point hands you the file you're indexing, side stepping the entire issue.

Looking at the documentation for CSImportExtension it appears that this is for gathering & updating information about a file, not the items in the file. The extension only gets one CSSearchableItemAttributeSet that is for metadata about the file.

An object that provides searchable attributes for file types that the app supports.

Even if I wanted to misuse the API just to trigger a full reindex of the file contents, due to sandbox restrictions I don't think it would be possible to open an arbitrary file. The user would have to select the file in an Open dialog.

My current solution is:

  1. When a document is opened start a Task that gathers all of the uniqueIdentifiers of the searchableItems for this file from the search index (using the contentURL to filter the results).
  2. Gather the model identifiers from SwiftData.
  3. Compare the 2 sets of identifiers. If the sets are not identical then index/delete the appropriate searchableItems (or reindex the whole file).

This isn't the most efficient solution, but the files I'm dealing with likely contain no more than several thousand items.

This leaves me with the outstanding question relating to the CSSearchIndexDelegate.reindex* methods. Should the acknowledgementHandler be called when an error is encountered? The documentation says:

The delegate … must call the acknowledgement handler after all client state information has been saved, so that the indexer can call this method again in case of a crash.

It doesn't describe what to do if the client state couldn't be saved. In my current solution I always call the acknowledgementHandler.

So, before everything else, I think the big question that needs to be clarified here is how exactly the user is most really going to interact with your app and documents. In particular, questions like:

  • How many documents is a "typical" user likely to have and how long is that document likely to "last"?

  • How should the results of searching at the system level compare to searching at the app level?

Looking at the documentation for CSImportExtension, it appears that this is for gathering & updating information about a file, not the items in the file. The extension only gets one CSSearchableItemAttributeSet that is for metadata about the file.

I think we need to step back and approach this in terms of two separate problems:

  1. System-wide search

  2. In-app search

As far as a system-wide search is concerned, I think you need to use CSImportExtension. The reason for that is actually this:

The extension only gets one CSSearchableItemAttributeSet that is for metadata about the file.

Files come and go all the time, and Spotlight needs to be able to account for that within its index without involving your app (note there isn't any "un-index" method/delegate). That's why it only uses a single "CSSearchableItemAttributeSet" object— it's associating that object with "the file" and managing it as such.

For background context, you may want to take a look at the "Core Data Spotlight Integration Programming Guide". The divide between the document-based import (CSImportExtension) and app-based import (CSIndexExtensionRequestHandler) is the natural consequence of architectures described in that. The section on "Record-Level Indexing" also explicitly warns against using that architecture in document-based apps, primarily because of the issue I noted above:

You can create Spotlight indexes where each record is indexed individually. This feature is only supported in non-document-based applications. For document-based applications, you should use store metadata as described in Store Metadata-Level Indexing.

Shifting the focus to #2:

An object that provides searchable attributes for file types that the app supports.

Even if I wanted to misuse the API just to trigger a full reindex of the file contents,

So, the first thing here is that if you're creating your own "database-style" file format, thinking of indexing as a standalone operation that's independent from your app’s normal interaction with the file. Your app already has the file open as it's modifying/manipulating it and it's easier for your app to maintain its "index state" as part of that process as part of its own normal manipulation of that file.

The issue here is that, no matter what you do, there's no way to avoid the situation where your app is going to be presented with a file that it's never seen before and which is not indexed*. Given that reality, I think the best approach is to create a solution which works well for that scenario, at which point the simplest approach is to just use that solution everywhere.

*For example, the user mounts an smb volume and immediately opens a file from it.

My current solution is:

So, first off, this is the most critical point to design around:

This isn't the most efficient solution, but the files I'm dealing with likely contain no more than several thousand items.

The biggest issue here is "how long does this take", both in terms of normal usage and the worst case. All of this is much easier to design around if the time required is kept very short, and my own intuition is to just optimize as necessary to ensure that it always IS "short".

Building on that point, the big issue here is what to do when a file you've already seen is opened again. I'm very nervous about this kind of approach:

Compare the 2 sets of identifiers. If the sets are not identical then index/delete the appropriate searchableItems (or reindex the whole file).

The problem with this kind of "reconciliation" is that:

  • It tends to be slow.

  • If anything goes wrong, you're probably going to end up having to do a full reindex.

  • As your app evolves over time, there's an increasing risk of missing "something" and ending up in a situation where some data is not indexed or removed.

What I think I'd do instead is to start by having every file embed a UUID, which is then used to track "that" file’s CSSearchableIndex. Then, whenever a new file is opened, I would either:

  1. Delete the entire index contents and reindex every time it's opened.

  2. Validate that the index is still valid for that file and discard and regenerate it if the validation fails.

That "validation" process can be as complicated as you want to make it, but what I'd probably do is have a designated entry in the data store which changes anytime the file is modified, which is also stored in the CSSearchableIndex. If the two values match, then reuse the index. If they don't, throw the index away.

There are other edge cases this approach doesn't entirely address, like:

  • Cleaning up old indexes in cases where the file is never opened again.

  • Multiple opens of the same/similar file (duplicate UUIDs).

That then leads to here:

This leaves me with the outstanding question relating to the CSSearchIndexDelegate.reindex* methods. Should the acknowledgementHandler be called when an error is encountered?

A few points here:

  • In the architecture I've laid out, you would not implement CSIndexExtensionRequestHandler, only CSImportExtension. As I talked about above, I don't think CSIndexExtensionRequestHandler can really work in a document-based app.

  • When running in your app with the approach above, I don't think you really have that many failure cases. You'd only have CSSearchableIndex objects for the files you actually had open, which makes failures less likely/common.

Returning to here:

Should the acknowledgementHandler be called when an error is encountered?

Again, this API was built for the "app" context, so there isn't really any way to "fail". The app is being told to index whatever data it wants and to tell the system when it's "done". In concrete terms, you're asking the question "what should I do if I'm trying to index a file I can't read", but the "app context" already implies that your app has full access to "all" of the data it wants to index. In the context of what I outlined above, I would call the acknowledgementHandler (basically, to keep the API contract "clean“). However, I don't think reindexing would really be occurring either.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you again for the thoughtful response. This is the high level solution I've ended up with.

To recap:

  • This is a SwiftUI document-based app: DocumentGroup backed by SwiftData.
  • A user will typically have only a few of these files.
  • A file will typically have only a few thousand indexed records.
  • Search is required for in-app use only; system-wide search is not.

The application maintains a single search index.

  • When the app launches it deletes all items from the search index. This keeps the search index clean in the case of renamed, deleted or seldom used files.

A document uses the domainIdentifier search attribute set to its URL to tag all of its searchable items.

  • When a document is opened it deletes all of its searchable items using the domainIdentifier to scope the operation. It then adds all of its searchableItems to the index.
  • The document adds, deletes & updates its searchable items while it is open.
  • When a document's URL changes it updates the domainIdentifier of its searchable items.
  • In-app search queries are always scoped to the domainIdentifier of the file.

Thank you again for the thoughtful response. This is the high-level solution I've ended up with.

On the whole, I think this all sounds reasonable. Throwing out a few potential tweaks/details:

Search is required for in-app use only; system-wide search is not.

If system-wide search would be "a nice-to-have", then I suspect you could pull out the "important" parts (meaning, the details a user might actually search for system-wide) of your app-level indexer and use that to implement CSImportExtension.

When a document's URL changes, it updates the domainIdentifier of its searchable items.

When you're dealing with long-term file references, you need to think "bookmarks", NOT URLs. That's partly because of security scope, but it's mostly because, in practice, file paths have one of the characteristic "patterns" that is FABULOUS at creating bugs. That is:

  • On any given system, file locations and general "configuration" are stable enough that the fully path to any given location tends to not change very often.

  • In real-world usage, paths are unstable enough that they can and will change, often for reasons that are not obvious to the user.

This dynamic makes it very easy to build something that works great on your machine and even some large number of testers, but then ends breaking badly as your user base broadens.

I think my suggestion here would actually be that you just generate a UUID for every document you open and use that as the domainIdentifier. Note that this also covers edge cases like the user doing the following while your app has "DocA" and "DocB" open:

  • Rename "DocA" -> "DocTemp"
  • Rename "DocB" -> "DocA"
  • Rename "DocTemp" -> "DocB"

As far as search is concerned, all of that was irrelevant. The UUID never changed, and all your app needs to worry about is the mapping between your document UUID and the particular file you have open, something the system largely handles for you.

That's also an option if you want to think about this again:

A user will typically have only a few of these files.

OK. So, one of the key design choices here is "do I try and track all my files and update/preserve my index based on that" and, obviously, fewer files make that more practical.

If you want to go that route, what I would probably do is have your app maintain a list that maps document UUIDs to security-scoped bookmarks, the resolving bookmarks at every launch to keep things up to date. This side-steps issues like what happens if the user duplicates one of your data files— the bookmark will only ever resolve to one file, so the original will remain "the file" and the duplicate will simply be "a new file" if/when the user opens it.

Note that this is also a situation where your interface may end up pushing lower-level choices, as you'd have to do most of this just to present a list of "your documents".

NOW, the one risk with "persistence" is that it's always going to be possible for the user to reorganize their data such that the files you open are not the files you think they are. For example:

  1. Start with "DocA" and "DocB"
  2. Duplicate them, creating "DocA_dup" and "DocB_dup".
  3. Delete the originals ("DocA" and "DocB")
  4. Rename "DocA_dup" -> "DocB"
  5. Rename "DocB_dup" -> "DocA"

...and now your bookmark for "DocA"/"DocB" resolves to the contents of "DocB"/"DocA". You can detect this kind of thing by embedding the document UUID into the document format; however, it definitely does start dragging in more edge cases. For example, file duplication means that you may need to modify the UUID of a document (because an existing document already has that UUID), but you may not be able to do that (because the document is read-only). That sounds like a weird edge case, but that's exactly what will happen in a case like:

  • Alice sends Bob "DocA"
  • Bob modifies his copy of "DocA"
  • Bob shares "DocA" back to Alice read-only.

So... I think this ends up being a judgement call between the benefits of persistent indexing and the annoyance of thinking through and predicting edge cases.

When a document is opened, it deletes all of its searchable items using the domainIdentifier to scope the operation. It then adds all of its searchableItems to the index.

I'd actually try doing this at close, as well as open. It's easier to "hide" any delay at close (your app can just close the UI immediately, so the delete operation isn't visible) and it minimizes any risk of oddity/disruption. You'll also do it at open (just in case something like a crash prevented the close delete), but this means that the open delete would normally be a very fast "found nothing" check.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Does Core Spotlight work with document-based apps?
 
 
Q