Update: Refining the Architecture with LLMs (Gloss-to-Text)
I've been refining the concept to make development faster and less data-dependent. Instead of trying to solve "Continuous Sign Language Recognition" purely through computer vision (which is extremely hard), we can split the workload.
The Hybrid Pipeline Proposal:
Vision Layer (ARKit): Focus strictly on Isolated Sign Recognition.
The CoreML model only needs to identify individual signs (Glosses) based on the skeleton data. It treats gestures as "Tokens".
Input: Skeleton movement.
Output: Raw tokens like [I], [WANT], [WATER], [PLEASE].
Logic Layer (LLM):
We feed these raw tokens into an On-Device LLM (or API). Since LLMs excel at context and syntax, the model reconstructs the sentence based on the tokens.
Input: [I] [WANT] [WATER] [PLEASE]
Output: "I would like some water, please."
Why this is faster to build:
We don't need a dataset of millions of complex sentences to train the Vision Model. We only need a dictionary of isolated signs. The "grammar" part is offloaded to the LLM, which is already solved technology. This drastically lowers the barrier for creating a functional prototype.
Topic:
Accessibility & Inclusion
SubTopic:
General
Tags: