Behind The Scenes - Generative Realities
Step 01
In Unreal Engine, I am using VaRest Blueprint nodes for API communication. I have a blank USoundwave variable that is set and overwritten each time the user confirms their audio input. I have created a custom C++ plugin to enable runtime USoundwave to Byte Array conversion. This is then converted to a Base64 string so it can be sent as a JSON object and doesn't require saving, storing or handling local audio files. I am using OpenAI's Whisper API to transcribe the user's voice input into a string. However, Whisper doesn't accept Base64 directly, so I am sending the converted Base64 from Unreal to a serverless AWS Lambda Function in order to temporarily re-convert it to WAV in the cloud. I set up an API Gateway from my Lambda Function so Unreal can communicate with it. The temporary WAV is sent to Whisper and the Transcription response is retrieved back in Unreal as a string.
Custom C++ Plugin for Converting USoundwave to Byte Array at runtime
Blueprint workflow for taking audio recording & converting to Base64 string
Blueprint workflow for sending Base64 string to custom Lambda Function
Lambda Function to convert Base64 to temporary WAV file and send to OpenAI Whisper API. Transcription Response sent back as the response to the initial Unreal Request.
Step 02
Once the Transcription has been received back in Unreal Engine and stored as a string variable, it is sent to OpenAI's Assistants API. Inside the Assistant API is where I can instruct the AI to format their response in a way that I can easily parse back in Unreal, which enables me to send separate requests to ElevenLabs, BlockadeLabs, Vision etc all from one single response. I am using the latest GPT-4o model which has an incredibly fast response time. The structure of each Assistants API Response is: 1. An initial response to the Transcription (which includes an instruction to end on a question to prompt the user to naturally further the adventure), 2. A key word/phrase to trigger a BlockadeLabs request for a new environment, 3. A backend "prompt" that can be used to generate both the new visual and the sound effect generation that will be associated with it, 4. An integer that represents the style in which the visual will be generated in.
Blueprint workflow for Sending the Message to an Assistants API Thread
Blueprint workflow for Initialising a Run on the Assistants API Thread
Blueprint workflow for retrieving the latest response to the Assistants API Thread
System Prompt to Assistants API for the response structure that can then be parsed
Step 03
Once the Response from the Assistants API has been received, this is parsed using various Split/Contains/Replace/Append nodes in Unreal Engine. The section of the Response that responds to the user's initial Transcription is sent to ElevenLabs Voice API for an authentic audio version of the Response. The section of the Response that creates the "prompt" for the visuals & sound effects are sent to their respective BlockadeLabs and ElevenLabsSFX functions. For both the ElevenLabs Voice API and ElevenLabs SFX API, this requires more Lambda Functions to handle serverless audio exchanges. My custom C++ runtime USoundwave to Byte Array function also works the opposite way, so I will ultimately be converting the responses from ElevenLabs to Base64 Strings inside my Lambdas, which are received as JSON back into Unreal where I can re-convert again. Whilst the Voice API supports multiple file formats, the SFX API only seems to currently support mp3 formats which aren't natively compatible with Unreal Engine, therefore I've had to action an additional step where I use Docker to build ffmpeg via Amazon Linux and attach it as a layer in my Lambda Function.
Lambda Function to send Assistants Response from Unreal to ElevenLabs and convert the WAV response to Base64
Unreal receives the Base64 which is converted back into a USoundwave via my custom Byte Array to Sound Wave C++ node
Byte Array to USoundwave C++
Lambda Function for sending request to ElevenLabs SFX API, receiving mp3 response, converting to WAV Base64 with ffmpeg layer
Compiling ffmpeg in Amazon Linux Docker
Receiving Base64 SFX Audio in Unreal, converting to Byte Array to USoundwave.
Step 04
Whilst Step 3 showcased the Audio related steps, Step 4 focuses on the visual aspect. From the initial Assistants API response, a "prompt" is parsed and sent to BlockadeLabs, as well as a Style ID. These are added to the BlockadeLabs API Request Body which responds with a file_url, which is then loaded at runtime as an Image Texture inside of a Material. At the same time, the file_url is sent to a separate OpenAI Chat Completion API, using GPT-4o's "Vision" capabilities to comprehensively describe the elements of the file_url image, which is then appended as part of any future requests to the Assistants API - that way, the Assistants API responses can include details of your environment for added depth and authenticity.
Blueprint workflow for sending POST Request to BlockadeLabs for 360 visuals
Blueprint workflow for sending GET Request to BlockadeLabs for 360 visuals
Blueprint workflow for loading file_url image into Material at runtime
Material with “Blockade” 2D Texture Parameter which is updated at runtime with the latest visual response
Blueprint workflow for sending file_url to Vision (through Lambda for easier JSON formatting)
Step 05
Steps 1-4 are repeated and all relevant variables are replaced at runtime, enabling an infinite reality generator!
Check out the below video for an uncut example from start to finish of the above workflow in action inside a Virtual Reality Controller-less experience!
Story Narration!
For "Story Mode", the user does not have any input this time, however the same base workflow applies for the Assistant to generate a brand-new story, for ElevenLabs to provide a captivating voiceover and for BlockadeLabs to provide stunning visuals. A key difference this time however, in order to prevent latency gaps between responses and in order to provide a large story in a cost effective measure, several smaller requests are sent and stored programmatically - this means that from the minute the AI starts talking, there are NO gaps until it has concluded it's story, which can be up to 10 minutes long!
These smaller requests are automated and consist of either “Start”, “Continue” or “End”, which action different functions via the Assistant Responses.
Check out an uncut example below to see what it’s capable of!
If you have any questions about any of the above processes, please drop me an email at: harryjamesfox@hotmail.co.uk, or contact me on LinkedIn :) I would be delighted to chat.