Fine Tuning Stable DiffusionXL to output specific context

How to fine tune text to image model under 100 Rupees

Dec 17, 2023

I always saw tweets of people mentioning how open source LLMs are developing at crazy pace. I never believed that until today. I started looking at Llama first (an open source foundational model developed by Meta), rabbit holed into GPUs from there, and finally reached HuggingFace and Replicate.

Replicate is somewhat Replit but for testing and building LLMs — that’s how I framed it 1 min into the website. And it proved out to be actually very easy to get started re-using some existing models the community has made and uploaded.

Replicate has a blog mentioning how you can fine tune a foundational model like SDXL on upto just 6 images of yours to start prompting a whole album of your own images. These images can be of an object, a pet, or yourself. Do we even need personal photographers now?

I decided to try this out — to train a subject and see how good this gets. I choose the anime character Sypha Belnades from Castelvania.

Before training, I tried the following prompts on the foundational model to know if the our subject is already known:

The prompt for the below image specifically mentioned, “Sypha Belnades and Trevor Belmont dancing in front of Dracula’s Castle, sunny day“. Look at the faces and the attire — the foundation model is definitely clueless on who these subjects are.

Fine Tuning

Thanks to the blog and YT video from @fofrai, I would have never been able to do this. But apparently its just five simple steps:

Create Replicate Account. Setup Billing. Generate API Token.
Make an image repo. And convert it into a zip file. Upload the zip file on the public repo
1. The first bug occurred here: It kept telling me that the zip file I created was not a zip file. Or the issue might have been related to uploading the zip in Github and the retrieval by Replicate.
2. Hence I had to create a zip using following API suggested in the blog (I did not want to use this because unaware where those images get hosted — but anyway had no option)
  —> RESPONSE=$(curl -s -X POST -H "Authorization: Token $INPUT_TOKEN" https://dreambooth-api-experimental.replicate.com/v1/upload/data.zip)
  —> curl -X PUT -H "Content-Type: application/zip" --upload-file INPUT_FILE.zip "$(grep -o "\"upload_url\": \"[^\"]*" <<< $RESPONSE | awk -F'"' '{print $4}')"
  —> SERVING_URL=$(grep -o "\"serving_url\": \"[^\"]*" <<< $RESPONSE | awk -F'"' '{print $4}')
  —> echo $SERVING_URL
3. We receive an output link like https://replicate.delivery/pbxt/…yadayadayada…/data.zip
Create a Model in Replicate.
1. Setup Type and SDXL and choose a GPU Nividia A40 Large.
Run a pre-written code in Google Collab. The model gets fine tuned in 10 mins.
Start prompting via API or on Replicate’s UI to get images

Costing

Nividia A40 Large costs $0.043 per min of compute. It took me around 10 mins to fine tune the 6 images I had ziped. And later running atleast 7-8 prompts added a few more mins of GPU. But the cost is still below 100 Rupees and the output is just mindblowing:

Output

Sypha Belnades from Castelvania. Images generated by fine tuned SDLX.

Now, go watch the Castlevania Trailer and compare the output with original. What do you folks think?

Implications

This is mindblowing for the reason that so many applications can potentially get disrupted because of this evolving tech. Imagine humans making a character once and then the AI can animate or replicate it in N counts of situations. (We may also test how accurately this might work with live action).

An AI can potentially start making movies on its own (there’s already an open source video model on HuggingFace) — or a single individual human with help of such AI models can create an entire film. Storytelling will get democratised beyond our imaginations.

But if this is the scenarios, will we as humans stay creative enough in the upcoming centuries. Or will we be outsourcing creativity entirely?

Knowledge Shots

Discussion about this post