Fine Tuning Stable DiffusionXL to output specific context
How to fine tune text to image model under 100 Rupees
I always saw tweets of people mentioning how open source LLMs are developing at crazy pace. I never believed that until today. I started looking at Llama first (an open source foundational model developed by Meta), rabbit holed into GPUs from there, and finally reached HuggingFace and Replicate.
Replicate is somewhat Replit but for testing and building LLMs — that’s how I framed it 1 min into the website. And it proved out to be actually very easy to get started re-using some existing models the community has made and uploaded.
Replicate has a blog mentioning how you can fine tune a foundational model like SDXL on upto just 6 images of yours to start prompting a whole album of your own images. These images can be of an object, a pet, or yourself. Do we even need personal photographers now?
I decided to try this out — to train a subject and see how good this gets. I choose the anime character Sypha Belnades from Castelvania.
Before training, I tried the following prompts on the foundational model to know if the our subject is already known:
The prompt for the below image specifically mentioned, “Sypha Belnades and Trevor Belmont dancing in front of Dracula’s Castle, sunny day“. Look at the faces and the attire — the foundation model is definitely clueless on who these subjects are.
Fine Tuning
Thanks to the blog and YT video from @fofrai, I would have never been able to do this. But apparently its just five simple steps:
Create Replicate Account. Setup Billing. Generate API Token.
Make an image repo. And convert it into a zip file. Upload the zip file on the public repo
The first bug occurred here: It kept telling me that the zip file I created was not a zip file. Or the issue might have been related to uploading the zip in Github and the retrieval by Replicate.
Hence I had to create a zip using following API suggested in the blog (I did not want to use this because unaware where those images get hosted — but anyway had no option)
—> RESPONSE=$(curl -s -X POST -H "Authorization: Token $INPUT_TOKEN" https://dreambooth-api-experimental.replicate.com/v1/upload/data.zip)
—> curl -X PUT -H "Content-Type: application/zip" --upload-file INPUT_FILE.zip "$(grep -o "\"upload_url\": \"[^\"]*" <<< $RESPONSE | awk -F'"' '{print $4}')"
—> SERVING_URL=$(grep -o "\"serving_url\": \"[^\"]*" <<< $RESPONSE | awk -F'"' '{print $4}')
—> echo $SERVING_URL
We receive an output link like https://replicate.delivery/pbxt/…yadayadayada…/data.zip                   Â
Create a Model in Replicate.
Setup Type and SDXL and choose a GPU Nividia A40 Large.
Run a pre-written code in Google Collab. The model gets fine tuned in 10 mins.
Start prompting via API or on Replicate’s UI to get images
Costing
Nividia A40 Large costs $0.043 per min of compute. It took me around 10 mins to fine tune the 6 images I had ziped. And later running atleast 7-8 prompts added a few more mins of GPU. But the cost is still below 100 Rupees and the output is just mindblowing:
Output





Now, go watch the Castlevania Trailer and compare the output with original. What do you folks think?
Implications
This is mindblowing for the reason that so many applications can potentially get disrupted because of this evolving tech. Imagine humans making a character once and then the AI can animate or replicate it in N counts of situations. (We may also test how accurately this might work with live action).
An AI can potentially start making movies on its own (there’s already an open source video model on HuggingFace) — or a single individual human with help of such AI models can create an entire film. Storytelling will get democratised beyond our imaginations.
But if this is the scenarios, will we as humans stay creative enough in the upcoming centuries. Or will we be outsourcing creativity entirely?