While generating assets, I encountered some crashes and failures. In this post, I will share how I
debugged and solved some of these issues.
First, I wrote a Python script that generated .png files with views of the object, so I could analyze each one of them. Here is the code snippet:
import torch
import diffusers
from diffusers import DiffusionPipeline
from PIL import Image
import sys
# Add missing class to diffusers
sys.path.append("threestudio/extern")
from zero123 import Zero123Pipeline
diffusers.Zero123Pipeline = Zero123Pipeline
# Prepare pipeline
pipeline = DiffusionPipeline.from_pretrained("ashawkey/stable-zero123-diffusers",
torch_dtype=torch.float16, trust_remote_code=True)
pipeline.to('cuda:0')
# Prepare input data:
image = Image.open("image.png").convert("RGB").resize((256, 256))
elevations = list()
azimuths = list()
camera_distances = list()
images = list()
for n in range(4):
elevations.append(torch.tensor([30.0], dtype=torch.float16).to('cuda:0'))
azimuths.append(torch.tensor([45.0 * n], dtype=torch.float16).to('cuda:0'))
camera_distances.append(torch.tensor([1.2], dtype=torch.float16).to('cuda:0'))
images.append(image)
# Generate and save images:
images = pipeline(images,
torch.tensor(elevations).to('cuda:0'),
torch.tensor(azimuths).to('cuda:0'),
torch.tensor(camera_distances).to('cuda:0'),
num_inference_steps=50).images
for num, azimuth in enumerate(azimuths):
images[num].save(F"{float(azimuth)}.png")
To illustrate how the network works, I will use a simple example that produced a good result. This image was successfully converted into a 3D object, as shown below:



However, not all images worked as well. For instance, this image caused some problems:


The cat looks fine, but the ground and the smoke on the left confused the network. I decided to mask out these parts and try again:

The cat is still a bit flat, but at least it did not crash. It is acceptable:

Next, I tried another more challenging image:


The network failed to generate a 3D object from this image. After inspecting those results I conclude, that the reason is that the object is too large and goes beyond the frame. I resized the image to make it smaller:

Unfortunately, this did not solve all the issues. It fixed the crash, but the shape is concave and very complex. The network was unable to predict the shape of the object. It produced something:
I can guess now, that the lack of space around the object was the cause of most of my crashes. But it does explain this one:


The tree-like structure switches positions constantly, and the roots break the legs of the table. This image is too complex for the network to handle. I removed the tree and kept only the table:

This time, the network was able to generate a 3D object, but with some artifacts. The network added some extra objects at every frame, which are not present in the original image. Still, the result is not perfect, or terrible:
Conclusion
From these experiments, I learned that some of the crashes and failures can be fixed by inspecting and modifying the source images. A good practice is to leave some empty space around the main subject in the image and avoid cluttered or noisy backgrounds.
I’ll try making some tools to make this process easier.