Create ML fails to train a text classifier using the BERT transfer learning algorithm

I'm trying to train a text classifier model in Create ML. The Create ML app/framework offers five algorithms. I can successfully train the model with all of the algorithms except the BERT transfer learning option. When I select this algorithm, Create ML simply stops the training process immediately after the initial feature extraction phase (with no reported error).

What I've tried:

  1. I tried simplifying the dataset to just a few classes and short examples in case there was a problem with the data.

  2. I tried experimenting with the number of iterations and language/script options.

  3. I checked Console.app for logged errors and found the following for the Create ML app:

error	10:38:28.385778+0000	Create ML	Couldn't read event column - category is invalid.  Format string is : <private>
error	10:38:30.902724+0000	Create ML	Could not encode the entity <private>. Error: <private>

I'm not sure if these errors are normal or indicative of a problem. I don't know what it means by the "event" column – I don't have an event column in my data and I don't believe there should be one. These errors are not reported when using the other algorithms.

Given that I couldn't get the app to work with BERT, I switched over to the CreateML framework and followed the code samples given in the documentation. (By the way, there's an error in the docs: the line let (trainingData, testingData) = data.stratifiedSplit(on: "text", by: 0.8) should be stratifying on "label", not on "text").

The main chunk of code looks like this:

var parameters = MLTextClassifier.ModelParameters(
    validation: .split(strategy: .automatic),
    algorithm: .transferLearning(.bertEmbedding, revision: 1),
    language: .english
)
parameters.maxIterations = 100

let sentimentClassifier = try MLTextClassifier(
    trainingData: trainingData,
    textColumn: "text",
    labelColumn: "label",
    parameters: parameters
)

Ultimately I want to train a single multilingual model, and I believe that BERT is the best choice for this. The problem is that there doesn't seem to be a way to choose the multilingual Latin script option in the API. In the Create ML app you can theoretically do this by selecting the Latin script with language set to "Automatic", as recommended in this WWDC video (relevant section starts at around 8:02). But, as far as I can tell, ModelParameters only lets you pick a specific language.

I presume the framework must provide some way to do this, since the Create ML app uses the framework under the hood, but I can't see a way to do it. Another possibility is that the Create ML app might be misrepresenting the framework – perhaps selecting a specific language in the app doesn't actually make any difference – for example, maybe all Latin languages actually use the same model under the hood and the language selector is just there to guide people to the right choice (but this is just my speculation).

Any help would be much appreciated! If possible, I'd prefer to use the Create ML app if I can get the BERT option to work – is this actually working for anyone? Or failing that, I want to use the framework to train a multilingual Latin model with BERT, so I'm looking for instructions on how to choose that specific option or confirmation that I can just choose .english to get the correct Latin multilingual model.

I'm running Xcode 26.2 on Tahoe 21.1 on an M1 Pro MacBook Pro. I have version 6.2 of the Create ML app.

Answered by Frameworks Engineer in 874130022

The CreateML framework API is not the best design we put out, and need some documentation to clarify. As long as you select BERT as the embedding type, it is inherently multi-lingual. The underlying implementation (beneath the CreateML framework) will identify the dominant language in your training data and then choose the per-script embedding model. Say if your training data's dominant language is English, the Latin-script embedding will be used, whereas if the dominant language is Chinese, the CJK-script embedding will be used.

I tried simplifying the dataset to just a few classes and short examples in case there was a problem with the data.

Can you share this sample training data you use to reproduce this issue in the CreateML app?

@tjia Thanks for looking into this. Here's some small example data that you can use to test it – I don't think the data really matters – it seems to fail whatever I put in.

Save this as a JSON file and create a new Text Classifier project with default settings and the BERT algorithm. For me, the training stops immediately after the feature extraction phase with no error other than "Training stopped".

[
	{
		"text": "Pinus contorta",
		"label": "animalsAndPlants"
	},
	{
		"text": "Rabbit",
		"label": "animalsAndPlants"
	},
	{
		"text": "Brochoadmones",
		"label": "animalsAndPlants"
	},
	{
		"text": "Zebra",
		"label": "animalsAndPlants"
	},
	{
		"text": "Oak",
		"label": "animalsAndPlants"
	},
	{
		"text": "Campanula rotundifolia",
		"label": "animalsAndPlants"
	},
	{
		"text": "Black wood pigeon",
		"label": "animalsAndPlants"
	},
	{
		"text": "Colorado potato beetle",
		"label": "animalsAndPlants"
	},
	{
		"text": "Corvidae",
		"label": "animalsAndPlants"
	},
	{
		"text": "Honey bee",
		"label": "animalsAndPlants"
	},
	{
		"text": "Pablo Picasso",
		"label": "artAndDesign"
	},
	{
		"text": "Paul Cézanne",
		"label": "artAndDesign"
	},
	{
		"text": "Marcel Duchamp",
		"label": "artAndDesign"
	},
	{
		"text": "Proto-Cubism",
		"label": "artAndDesign"
	},
	{
		"text": "Vincent van Gogh",
		"label": "artAndDesign"
	},
	{
		"text": "Cubism",
		"label": "artAndDesign"
	},
	{
		"text": "Rococo",
		"label": "artAndDesign"
	},
	{
		"text": "Art",
		"label": "artAndDesign"
	},
	{
		"text": "Interior design",
		"label": "artAndDesign"
	},
	{
		"text": "Typography",
		"label": "artAndDesign"
	}
]

I was not able to reproduce the failure you saw (training with BERT works with your toy data on my Mac, different OS/Xcode versions though). To clarify your macOS is Tahoe 26.1 and Xcode is 26.2, correct?

Can you file a feedback with sysdiagnose? Meanwhile, I will try to get the exact same setup as yours to see if we can repro.

Hi @Frameworks Engineer thanks for looking into this.

Yes, here are the relevant version numbers:

  • Tahoe Version 26.1 (25B78)
  • Xcode Version 26.2 (17C52)
  • Create ML Version 6.2 (175.7)
  • MacBook Pro (M1 Pro, 32 GB)

I did wonder if maybe I don't have the model assets on my machine, but note that I can successfully run the BERT option using the framework directly (with the above code).

Although it's a little less convenient, I could simply build the models using the framework, but to do so I was hoping to get confirmation on two points:

  • Is there a specific way to select the Latin/Automatic option to get the multilingual model which I can train with multi-language training data?
  • If not, should I simply set the language to e.g. .english to get the multilingual model. If I do so, does that mean that all languages will be tokenized with the "wrong" tokenizer, or could there be some other problem in doing this?

Many thanks for any guidance you can provide.

Submitted feedback: FB21730591

You'd just need to set language: nil to get the behavior of latin/automatic.

We tried your macOS/Xcode but still can't reproduce your error, unfortunately.

We tried your macOS/Xcode but still can't reproduce your error, unfortunately.

Okay, well thanks for trying.

You'd just need to set language: nil to get the behavior of latin/automatic.

But then how would I select the CJK/Automatic or Cyrillic/Automatic options? As far as I can tell, the CreateML framework doesn't distinguish between language and language family, so it's unclear how the Create ML app is able to make this distinction or what the distinction actually means.

From the WWDC video I linked to previously, it sounds like there is one model per language family, which would mean there's essentially no difference between choosing Latin/Automatic and Latin/English, or CJK/Automatic and CJK/Japanese.

Can you clarify what the difference it? Possible theories include:

  1. The language selector menu is just there to reassure users that they made the right choice of family, but it makes no difference under the hood.
  2. Picking a specific language changes how the data is tokenized before training.
  3. There are in fact separate BERT models trained on monolingual data, as well as models trained on multilingual data.
Accepted Answer

The CreateML framework API is not the best design we put out, and need some documentation to clarify. As long as you select BERT as the embedding type, it is inherently multi-lingual. The underlying implementation (beneath the CreateML framework) will identify the dominant language in your training data and then choose the per-script embedding model. Say if your training data's dominant language is English, the Latin-script embedding will be used, whereas if the dominant language is Chinese, the CJK-script embedding will be used.

Create ML fails to train a text classifier using the BERT transfer learning algorithm
 
 
Q