I took a quick look at your code you posted earlier in the thread. Are you still using HAMP? it’s my understanding there’s an unavoidable delay between setting a command that way and it happening on the device. If you are, have you tried using HSP with the v3 api? It lets you buffer actions to the device memory to be done at the specified times
There’s a demo at https://gitlab.com/sweettechas/platform/platform-api-examples/-/tree/main/handy-rest-api-v3
though it’s got a bad url so you need to edit handy-rest-api-v3\sse\device-status\device-status.js and change ‘Handyverse’ v3-1 on line 11 to ‘Handyverse’ v3
then you can do as it says in the main readme file to launch the demo, go to hsp, patterns, and observe on your handy how it can play once or loop complex patterns from points. (You must create an application id here for this version of API, and if you don’t enter it first the demo page will get confused and you may have to clear cookies for it once you have one)
This is the API I’m planning to use in my own stuff on my next go at it.
My current approach might be helpful to you.
After a few runs, I decided that in “Stroke” mode, when I send a “signal” as a user, the LLM’s response should not be generated first.
If I tell the LLM “Hey, I’m almost there” or “I’m coming now,” the LLM can’t take 15 seconds to generate a response if it’s going to have any effect.
For this reason, I use a pool of possible pre-generated responses. These are regenerated in the background immediately after they are used. With a response cache, you can also use them across multiple sessions.
I’ve also come up with something for the patterns. Instead of having the LLM generate an entire pattern, I simply use my current level + limits/speed from the LLM. My pattern service then generates a script (approx. 25 to 45 seconds long) based on these values and predefined segments.
A segment can be a simple up and down movement, for example.
Depending on the level, I then simply add acceleration/deceleration + cyclic pauses. This gives me enough flexibility.
For the LLM to create “good” scripts, it would first need to understand what it is generating, and I don’t see how this is possible without specific training for this purpose.
Ultimately, it depends on the user’s experience and perception. Not everything has to be 100% generated by the LLM, as long as it feels that way.
If you are awaiting a complete response each time (which I assume from the 15 seconds thing), you might find this useful. In my last implementation, I processed the response from the LLM as a stream, token by token - just as the typical chat frontend does. I concatenated the tokens onto a string until I saw that a delimiter had occurred in it (in my case "RESPONSE: "), then i parsed out any hardware commands (ex: setting speed, depth). Only the tokens after that delimiter were displayed or whatnot. This approach lets the LLM emit quick non-dialogue information first, the program get and use that almost instantly (<1sec), and then display any text afterward as it comes in, generated by the LLM. This should work even with json outputs if you use a json decoder that can decode incomplete json (or just substring out the bits you want).
The idea of a cache is interesting, but if I can avoid it I wouldn’t want to do it because it means the responses are not really taking into account what the user just said, you know? I can see why you might do that if the alternative is a massive delay though.
(edit: also if you’re having huge delays make sure you’re only changing the end of the context, don’t alter the system prompt or other messages too far back in the history, just invisibly put any state info onto the beginning of a user’s current message for example while not retaining it in older messages. this allows the inference server to avoid recalculating as much of the kv cache)
I’ll let you know in advance that I had this reply translated into English, because I wanted to share another perspective with you in the hope that it might help with your builds. I’ve been following the delay discussion. I have a similar setup and don’t see these delays. Here’s what I do:
If you wait for the LLM response to generate movements, you’ll have a delay. The fix is to generate movements from the user’s text, not from the LLM response.
How my system works
LLM generates hidden commands in JSON
The LLM responds in JSON: {“chat”: “text to show”, “move”: {“sp”: 50, “dp”: 50, “rng”: 50}}
The “move” field contains commands but isn’t shown in chat
I extract only “chat” for display
As a fallback, I also extract a “move” semantically from the LLM’s text if no explicit “move” is in the JSON
Movements are generated from user text, not the LLM response
When the user writes, I parse their text to extract cues (e.g., “fast”, “slow”, “I’m coming”)
I use simple keyword matching (substring checks, not regex)
I generate movements immediately from these cues + current phase using predefined envelopes
Movements are generated dynamically with gaussian distribution for natural variation
The LLM’s text is analyzed semantically to extract a “move” as fallback, but this is ignored for main movements
Execution flow
I call the LLM first (takes ~5 seconds)
While waiting, I already have the user’s text
After the LLM responds, I generate movements from the user’s text (not from the LLM response)
Movements start immediately after generation (<1 second)
Both the LLM’s explicit “move” field and the semantically extracted “move” from LLM text are completely ignored
Practical example
User writes: “I’m about to come, faster!”
My system:
Calls LLM (takes ~5 seconds)
After LLM responds, reads user text “I’m coming” → triggers fast movement
Reads “faster” → increases speed
Generates movements from user text + phase envelopes
Movement starts immediately (<1 second after LLM response)
The LLM’s “move” (whether explicit or semantically extracted) is ignored
Total delay: ~5 seconds (LLM response time), not 15+ seconds.
Key difference
Your current approach: wait for LLM → extract “move” from LLM response → execute → delay
My approach: wait for LLM → generate movements from user text (not LLM response) → execute immediately → minimal delay
Don’t use the LLM’s “move” field or semantically extracted moves from LLM text. Generate movements from the user’s text instead:
Parse user text for keywords/cues (simple substring matching)
Use predefined phase envelopes (WARM-UP, ACTIVE, RECOVERY, etc.)
Generate movements dynamically based on phase + cues
Execute immediately after LLM responds
The LLM is only for chat responses, not for movements. Even if you wait for the LLM, movements can start immediately because they’re generated from the user text you already have.
Note on the “RESPONSE:” delimiter
If you use JSON, you don’t need a delimiter: “chat” and “move” are already separated. Just extract “chat” for display. But more importantly, ignore the “move” field entirely and generate movements from user text instead.
Hope this helps.
P.S. As for response latency, one of the first changes I made compared to the original version was hooking the language model up to LM Studio, and I’ve never had any delay issues since.
For completeness, these are the models I tested: Meta Llama 3.1 12B, Qwen 2.5 7B Instruct Uncensored, Nous Hermes 2 Mistral 7B, and Hermes Llama 3.1 12B.
I know if I was using JSON I wouldn’t need a delimiter. I chose not to use JSON, because it was unnecessary. I only mentioned JSON at all on the assumption that people with a derivative of StrokeGPT would be using JSON. I consider JSON a waste of tokens here, at least with a modern model.
As I understand it, you’re proposing not using the AI to decide what to do, but still only executing the decision when the AI starts generating a chat response. What I proposed was having the AI state what to do as the first few tokens, and then the chat response afterward. These should be effectively the same delay - they both only act after the LLM starts responding - except I would still benefit from the AI’s natural language processing to decide the “move.”
Either way, I don’t experience any meaningful delays myself. Any commands come first in the response, so the only delay is prompt processing / prefill, and that’s essentially nothing. My time-to-first-token from llama-server on a GTX 1080 Ti is like 0.5 seconds, and any commands before the mentioned delimiter are just the first few tokens. So in my case, there’s no need to compromise and do non-AI analysis of text input to determine things like that. I’ve been doing this with Qwen 3 8B heretic, with thinking disabled. That’s probably important given the significant advances in LLMs this year, if you’re working with models of this size from like a year ago they’re going to be dramatically less flexible and that changes what strategies are viable.
I’m having the same problem and haven’t been able to figure out what you meant by needing to /start. Would you mind explaining how you got it working with that?
Well… I work on it if i have free-time but its slow progress cause i am neither a coder or artist.
I had to re-iterate on a lot ideas too… and i kept swapping visual assets for more appropriate ones; cause i abandoned the idea of things happening at one location.
While i understand your enthusiasm; its best to temper expectations for now.
Perhaps check once a month if you feel like it; If i ever finish a working prototype; then peeps can download it from this site anyway.
Gotta add C:\Users\Your_Username\AppData\Local\Programs\Python\PythonXX\Scripts to your PATH. PythonXX should be whatever python version you have installed (e.g Python314), and then reopen your terminal.
I know it was mentioned back in December as a :shrug: sort of feature but I figured I’d add support to say it would be fantastic to have this integrate with T-code devices (sr6, osr2) even just getting one axis of control
I created a StrokeGPT on steroids. Fixed many problems, introduced a shitload of new features (eg. integrated video player, TTS support for Piper , using enhanced (local) TTS models. (free) and integrated Native Browser TTS support. (if Piper is too heavy on your resources)
Enhanced AI experience by introducing new LLM model for use in Ollama, finetuned LLM trained for use with ‘The Handy’. Option to change LLM models. Added verbosity levels (short, normal and story) to enhance conversations.
Added additional persona profiles (and ability to save them)
Added warmup , ramp up / ramp down modes
Calibration of tip and base position on startup of the application. Range problems are gone.
Quick commands bar for use with manual or auto mode AI chat.
Added manual control interface. Added presets , pattern sequences.
It’s basically a complete new application. Did a complete overhaul of this awesome project.
I call it ‘Handy Station’ because it’s more like a complete control center for ‘The Handy’.