avatarAI TutorMaster

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

9533

Abstract

Ensures faithfulness of personalization, Classifier-Free Guidance and Adds lightweight decoder guidance</li><li>Improves text-image semantic consistency</li></ul><h2 id="5de5">Optimization Scheme</h2><ul><li>PhotoMaker components trained from scratch</li><li>Stable Diffusion backbone frozen for efficiency</li><li>Adam optimizer with learning rates 1e-4 and 1e-5</li><li>48 batch size for 8 days on 8 A100 40GB GPUs</li><li>FP16 mixed precision for faster throughput</li></ul><h2 id="ce59">The composed model strikes a balance between leveraging state-of-the-art pretrained networks and introducing customizations</h2><p id="ee8d">The resulting model strikes an optimal balance between quality, editability and inference speed by exploiting state-of-the-art methods as building blocks within an innovative framework tailored for human image generation.</p><p id="8957"><b><i>At inference, only a forward pass is required to encode input images. This allows realistic identity-consistent editing in around 10 seconds on a V100 GPU rather than hours of fine-tuning.</i></b></p><h1 id="3a37">Evaluation</h1><p id="d3b1">PhotoMaker was rigorously evaluated on held-out test identities and prompts against comparisons like DreamBooth and FastCompose:</p><ul><li>Achieves best identity preservation with +11 DINO points vs FastCompose</li><li>High prompt relevance only -2 CLIP-T points below DreamBooth</li><li>Generates more diverse facial expressions than all methods</li><li>Matches overall image quality with state-of-the-art FID</li><li>Roughly 100x faster inference than DreamBooth</li></ul><figure id="0bfb"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ImmGNaQ8XNfDV9WFJeFpXA.png"><figcaption></figcaption></figure><p id="45f0">The results validate effectiveness at balancing fidelity, flexibility and speed. Ablations also confirm the benefits of critical proposal components.</p><h1 id="4352">🔧 Dependencies and Installation</h1><ul><li>Python >= 3.8 (Recommend to use <a href="https://www.anaconda.com/download/#linux">Anaconda</a> or <a href="https://docs.conda.io/en/latest/miniconda.html">Miniconda</a>)</li><li><a href="https://pytorch.org/">PyTorch >= 2.0.0</a></li></ul><div id="e1d2"><pre>pip install -r requirements.<span class="hljs-property">txt</span></pre></div><h1 id="8d29">⏬ Download Models</h1><p id="f78c">The model will be automatically downloaded through following two lines:</p><div id="aa62"><pre>from huggingface_hub <span class="hljs-keyword">import</span> <span class="hljs-type">hf_hub_download</span> <span class="hljs-variable">photomaker_path</span> <span class="hljs-operator">=</span> hf_hub_download(repo_id=<span class="hljs-string">"TencentARC/PhotoMaker"</span>, filename=<span class="hljs-string">"photomaker-v1.bin"</span>, repo_type=<span class="hljs-string">"model"</span>)</pre></div><p id="e967">You can also choose to download manually from this <a href="https://huggingface.co/TencentARC/PhotoMaker">url</a>.</p><h1 id="0a9c">💻 How to Test</h1><h1 id="d4ac">Realistic generation</h1><div id="f95c"><pre><span class="hljs-keyword">import</span> torch <span class="hljs-keyword">import</span> os <span class="hljs-keyword">from</span> diffusers.<span class="hljs-property">utils</span> <span class="hljs-keyword">import</span> load_image <span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> <span class="hljs-title class_">EulerDiscreteScheduler</span> <span class="hljs-keyword">from</span> photomaker.<span class="hljs-property">pipeline</span> <span class="hljs-keyword">import</span> <span class="hljs-title class_">PhotoMakerStableDiffusionXLPipeline</span></pre></div><div id="cf1e"><pre><span class="hljs-comment">## I downloaded the model locally in my colab notebook</span> <span class="hljs-comment">## gloal variable and function</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">image_grid</span>(<span class="hljs-params">imgs, rows, cols, size_after_resize</span>): <span class="hljs-keyword">assert</span> <span class="hljs-built_in">len</span>(imgs) == rows*cols

w, h = size_after_resize, size_after_resize

grid = Image.new(<span class="hljs-string">'RGB'</span>, size=(cols*w, rows*h))
grid_w, grid_h = grid.size

<span class="hljs-keyword">for</span> i, img <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(imgs):
    img = img.resize((w,h))
    grid.paste(img, box=(i%cols*w, i//cols*h))
<span class="hljs-keyword">return</span> grid

base_model_path = <span class="hljs-string">'SG161222/RealVisXL_V3.0'</span> photomaker_path = <span class="hljs-string">'release_model/photomaker-v1.bin'</span> device = <span class="hljs-string">"cuda"</span> save_path = <span class="hljs-string">"./outputs"</span>

<span class="hljs-comment"># Load base model</span> pipe = PhotoMakerStableDiffusionXLPipeline.from_pretrained( base_model_path, torch_dtype=torch.bfloat16, use_safetensors=<span class="hljs-literal">True</span>, variant=<span class="hljs-string">"fp16"</span>, <span class="hljs-comment"># local_files_only=True,</span> ).to(device)

<span class="hljs-comment"># Load PhotoMaker checkpoint</span> pipe.load_photomaker_adapter( os.path.dirname(photomaker_path), subfolder=<span class="hljs-string">""</span>, weight_name=os.path.basename(photomaker_path), trigger_word=<span class="hljs-string">"img"</span> )

pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config) pipe.fuse_lora()</pre></div><div id="c1b1"><pre><span class="hljs-comment"># define and show the input ID images</span> input_folder_name = './examples/newton_man' image_basename_list = os.listdir(input_folder_name) image_path_list = sorted([os.path.join(input_folder_name, basename) for basename in image_basename_list])

input_id_images = [] for image_path in image_path_list: input_id_images.append(load_image(image_path))

input_grid = image_grid(input_id_images, 1, 4, size_after_resize=224) print(<span class="hljs-string">"Input ID images:"</span>) input_grid </pre></div><figure id="e537"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*x1A1CrDc12l_zRdb0xtQ8A.png"><figcaption></figcaption></figure><div id="90ea"><pre><span class="hljs-comment"># Note that the trigger word img must follow the class word for personalization</span> prompt = <span class="hljs-string">"sci-fi, closeup portrait photo of a man img wearing the sunglasses in Iron man suit, face, slim body, high quality, film grain"</span> negative_prompt = <span class="hljs-string">"(asymmetry, worst quality, low quality, illustration, 3d, 2d, painting, cartoons, sketch), open mouth"</span> generator = torch.Generator(device=device).manual_seed(42)

<span class="hljs-comment">## Parameter setting</span> num_steps = 50 style_strength_ratio = 20 start_merge_step = int(float(style_strength_ratio) / 100 * num_steps) if start_merge_step > 30: start_merge_step = 30

images = pipe( prompt=prompt, input_id_images=input_id_images, negative_prompt=negative_prompt, num_images_per_prompt=4, num_inference_steps=num_steps, start_merge_step=start_merge_step, generator=generator, ).images</pre></div><div id="4ae9"><pre><span class="hljs-comment"># Show and save the results</span> <span class="hljs-comment">## Downsample for visualization</span> grid = image_grid(images, <span class="hljs-number">1</span>, <span class="hljs-number">4</span>, size_after_resize=<span class="hljs-number">512</span>)

os.makedirs(save_path, exist_ok=<span class="hljs-literal">True</span>) <span class="hljs-keyword">for</span> idx, image <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(images): image.save(os.path.join(save_path, <span class="hljs-string">f"photomaker_<span class="hljs-subst">{idx:02d}</span>.png"</span>))

<span class="hljs-built_in">print</span>(<span class="hljs-string">"Results:"</span>) grid</pre></div><figure id="5418"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Lj9matu2abNIzoAbGXdnzQ.png"><figcaption></figcaption></figure><p id="1c5b">For more examples, you can check <a href="https://github.com/TencentARC/PhotoMaker/tree/main">here</a></p><h1 id="12d1">Stylization generation</h1><p id="e192">Note: only change the base model and add the LoRA modules for better stylization</p><div id="5d1a"><pre><span class="hljs-keyword">import</span> torch <span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np <span class="hljs-keyword">import</span> random <span class="hljs-keyword">import</span> os <span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image

<span class="hljs-keyword">from</span> diffusers.utils <span class="hljs-keyword">import</span> load_image <span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> DDIMScheduler <span class="hljs-keyword">from</span> huggingface_hub <span class="hljs-keyword">import</span> hf_hub_download

<span class="hljs-keyword">from</span> photomaker.pipeline <span class="hljs-keyword">import</span> PhotoMakerStableDiffusionXLPipeline <span class="hljs-comment"># gloal variable and function</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">image_grid</span>(<span class="hljs-params">imgs, rows, cols, size_after_resize</span>): <span class="hljs-keyword">assert</span

Options

<span class="hljs-built_in">len</span>(imgs) == rows*cols

w, h = size_after_resize, size_after_resize

grid = Image.new(<span class="hljs-string">'RGB'</span>, size=(cols*w, rows*h))
grid_w, grid_h = grid.size

<span class="hljs-keyword">for</span> i, img <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(imgs):
    img = img.resize((w,h))
    grid.paste(img, box=(i%cols*w, i//cols*h))
<span class="hljs-keyword">return</span> grid

base_model_path = <span class="hljs-string">'./civitai_models/sdxlUnstableDiffusers_v11.safetensors'</span> photomaker_path = <span class="hljs-string">'./release_model/photomaker-v1.bin'</span> lora_path = <span class="hljs-string">'./civitai_models/xl_more_art-full.safetensors'</span> <span class="hljs-comment">#here is lora path rest code is same actually</span>

device = <span class="hljs-string">"cuda"</span> save_path = <span class="hljs-string">"./outputs"</span> <span class="hljs-comment"># Load base model</span> pipe = PhotoMakerStableDiffusionXLPipeline.from_single_file( base_model_path, torch_dtype=torch.bfloat16, original_config_file=<span class="hljs-literal">None</span>, ).to(device)

<span class="hljs-comment"># Load PhotoMaker checkpoint</span> pipe.load_photomaker_adapter( os.path.dirname(photomaker_path), subfolder=<span class="hljs-string">""</span>, weight_name=os.path.basename(photomaker_path), trigger_word=<span class="hljs-string">"img"</span> )

pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Loading lora..."</span>) pipe.load_lora_weights(os.path.dirname(lora_path), weight_name=os.path.basename(lora_path), adapter_name=<span class="hljs-string">"xl_more_art-full"</span>) pipe.set_adapters([<span class="hljs-string">"photomaker"</span>, <span class="hljs-string">"xl_more_art-full"</span>], adapter_weights=[<span class="hljs-number">1.0</span>, <span class="hljs-number">0.5</span>]) pipe.fuse_lora()

</pre></div><div id="6cae"><pre><span class="hljs-comment"># define and show the input ID images</span> image_path = './examples/scarletthead_woman/scarlett_0.jpg'

input_id_images = [] input_id_images.append(load_image(image_path))

input_grid = image_grid(input_id_images, 1, 1, size_after_resize=224) print(<span class="hljs-string">"Input ID images:"</span>) input_grid </pre></div><figure id="e3ea"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ZnjNrquiX9S1G6jRQUM4Vw.png"><figcaption></figcaption></figure><div id="a2f5"><pre><span class="hljs-comment">## Note that the trigger word img must follow the class word for personalization</span> prompt = <span class="hljs-string">"A girl img riding dragon over a whimsical castle, 3d CGI, art by Pixar, half-body, screenshot from animation"</span> negative_prompt = <span class="hljs-string">"realistic, photo-realistic, bad quality, bad anatomy, worst quality, low quality, lowres, extra fingers, blur, blurry, ugly, wrong proportions, watermark, image artifacts, bad eyes, bad hands, bad arms"</span> generator = torch.Generator(device=device).manual_seed(42)

<span class="hljs-comment">## Parameter setting</span> num_steps = 50 style_strength_ratio = 20 start_merge_step = int(float(style_strength_ratio) / 100 * num_steps) if start_merge_step > 30: start_merge_step = 30

images = pipe( prompt=prompt, input_id_images=input_id_images, negative_prompt=negative_prompt, num_images_per_prompt=4, num_inference_steps=num_steps, start_merge_step=start_merge_step, generator=generator, ).images</pre></div><div id="2dbf"><pre><span class="hljs-comment"># Show and save the results</span> <span class="hljs-comment">## Downsample for visualization</span> grid = image_grid(images, <span class="hljs-number">1</span>, <span class="hljs-number">4</span>, size_after_resize=<span class="hljs-number">512</span>)

os.makedirs(save_path, exist_ok=<span class="hljs-literal">True</span>) <span class="hljs-keyword">for</span> idx, image <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(images): image.save(os.path.join(save_path, <span class="hljs-string">f"photomaker_style_<span class="hljs-subst">{idx:02d}</span>.png"</span>))

<span class="hljs-built_in">print</span>(<span class="hljs-string">"Results:"</span>) grid</pre></div><figure id="b0a1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*7ZWCHyO3G07jrU4p2nkDYw.png"><figcaption></figcaption></figure><h2 id="8658">Resources Required:</h2><figure id="a392"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qoi2UIMghYhB_F2cDHwCTA.png"><figcaption></figcaption></figure><p id="8e96"><b>As you can see above with RTX 4090 i was able to run it on local PC</b></p><h1 id="87e5">Conclusion</h1><p id="fc1c">PhotoMaker introduces an elegant technique to distill visual identity cues through the stacked embedding for diffusion model text conditioning. This uniqueness enables practical applications within an efficiently trainable framework.</p><p id="2062">Ongoing innovations in conditioned image synthesis hold exciting potential. Embedding identity semantics is a promising direction to pursue, as PhotoMaker has demonstrated by producing creative, recognizable and customizable human portraits through intuitive text prompts.</p><figure id="bef1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*fflwJ9psdxdRDEcP"><figcaption>Created by Dal-E</figcaption></figure><h2 id="826d">If you are interested:</h2><div id="2a58" class="link-block"> <a href="https://levelup.gitconnected.com/shrink-the-llm-boost-the-inference-mixture-of-experts-llms-with-offloading-on-local-gpu-2c4bba400f4d"> <div> <div> <h2>Shrink the LLM & Boost the Inference: “Mixture-of-Experts” LLM’S with Offloading on local GPU</h2> <div><h3>Supercharge Text Generation, Ditch the Server Farms: Unleash Trillion-Parameter AI on Your Laptop.</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*PkseKlCjrp4jGP5a9ij-5A.png)"></div> </div> </div> </a> </div><div id="3a8e" class="link-block"> <a href="https://levelup.gitconnected.com/master-prompt-engineering-5-key-principles-and-unlock-full-potential-of-chatgpt-other-llms-af3b0a042276"> <div> <div> <h2>Master Prompt Engineering :“5 Key Principles” and unlock full potential of ChatGPT & other LLM’S</h2> <div><h3>A Revolutionary Guide for Optimizing AI Response and the Art and Science of Prompt Engineering</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*36p5SzSkjrFvNa6t8ONyLw.jpeg)"></div> </div> </div> </a> </div><div id="2ec3" class="link-block"> <a href="https://levelup.gitconnected.com/mastering-googles-gemini-pro-complete-guide-to-learn-google-s-free-ai-better-then-chatgpt-4-4e9b6e7e747a"> <div> <div> <h2>Mastering Google’s “Gemini Pro”: Complete Guide to Learn Google’s Free AI (Better then ChatGPT-4?)</h2> <div><h3>Uncover the Full Potential of Gemini Pro — Free API Access(Untill March) and same performance as ChatGPT-4</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*TbLtumNUScIdnQ6IHXbRAg.jpeg)"></div> </div> </div> </a> </div><div id="fd7e" class="link-block"> <a href="https://levelup.gitconnected.com/streamdiffusion-witness-real-time-ai-creativity-with-your-eyes-e1b7f72587be"> <div> <div> <h2>StreamDiffusion: Witness Real-Time AI Creativity with Your Eyes</h2> <div><h3>Revolutionizing Image Generation with Interactive, High-Speed Diffusion Technology</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*W8TO5mUyXCewhiGW.gif)"></div> </div> </div> </a> </div><div id="d029" class="link-block"> <a href="https://levelup.gitconnected.com/introducing-mixtral-8x7b-revolution-in-ai-language-models-better-then-chatgpt3-5-llama-70b-336f85a4e24f"> <div> <div> <h2>Introducing Mixtral 8x7B: Revolution in AI Language Models (Better then CHATGPT3.5 & LLAMA-70B)</h2> <div><h3>Exploring the Breakthroughs in Language Processing with the Advanced SMoE Architecture</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*6bhUfi8y_cm8ORUY6hsYFA.png)"></div> </div> </div> </a> </div></article></body>

“PhotoMaker” : Tool that Brings AI-Powered Personalization at your local machine with 16GB GPU!

Forget expensive software or online tools! PhotoMaker, with its stacked embedding technique, Craft highly detailed and customized portraits on your own computer.

Introduction

Recent advances in generative models have led to remarkable progress in synthesizing realistic human photos conditioned on text descriptions. However, existing personalized generation methods cannot simultaneously satisfy high efficiency, identity fidelity, and flexible text controllability.

To address these limitations, researchers from Nankai University, Tencent PCG, and The University of Tokyo introduced PhotoMaker — an efficient personalized text-to-image generation method that encodes input ID images into a stacked ID embedding to preserve identity information.

This article provides an in-depth look at how PhotoMaker works and details key innovations like the stacked ID embedding, merging cross-attention mechanism, automated dataset construction pipeline and its applications.

The Stacked ID Embedding

The core innovation of PhotoMaker is the stacked ID embedding which serves as a unified representation encapsulating visual information from multiple input ID images. This allows comprehensive preservation of identity characteristics necessary for generating customized portraits.

The pipeline starts by encoding input images using a CLIP ViT-L/14 image encoder to obtain embeddings which are projected to match text embedding dimensions. The class embedding (man/woman) is then fused with each image embedding using MLPs. The resulting fused embeddings {eˆ1, …, eˆN} represent semantic identity information.

These are concatenated along the length dimension into a single stacked ID embedding s* ∈ RNxD where N is the variable number of input images and D is the projected dimension.

Crucially, at inference the stacked embedding can be constructed from different ID images enabling identity mixing applications. The sub-parts maintain correspondence to each input image.

source- here

Merging Cross-Attention Mechanism

The key advantage of diffusion models is the cross-attention layers which can naturally integrate signals from text and image. PhotoMaker exploits this to merge identity information from the stacked embedding.

First, the class word is replaced with s* in the text embedding to get an updated embedding t* containing both text and ID semantics. The cross-attention is formulated as:

Attention(Q, K, V) = softmax(QKT/√d) · V

Where Q,K and V projections of the text embedding attend to the latent image representation to direct generation.

Additional LoRA training on attention matrices helps better perceive ID characteristics. The model itself learns to integrate signals based on relevance which outperforms other composition approaches.

Automated Dataset Construction

Generating the stacked ID embedding necessitates simultaneous multi-image input of consistent identities during training. However, existing human datasets do not provide fine-grained identity annotation.

Encoders

  • Uses CLIP ViT-L/14 model to encode input ID images into embeddings
  • Image encoder layers are fine-tuned to focus on human identity features
  • Projection layers align dimensions to text embedding space

Stacking

  • Text embedding class word (man/woman) fused with each image embedding
  • Fused embeddings concatenated along length dimension into single stacked tensor
  • Allows variable number of input ID images to compose rich identity representation

Merging

  • Cross-attention in diffusion model exploits stacked embedding for generation
  • Text embedding class word replaced with stacked ID embedding
  • Attention learns to selectively integrate identity features from input images
source- here

Automated Dataset Pipeline Construct training data with fine-grained identity labels:

Image Downloading

  • Web crawl celebrity names from face datasets to retrieve images

Face Detection and Filtering

  • Use RetinaNet model to detect face regions
  • Filter low quality or irrelevant non-face images

Identity Verification

  • Employ ArcFace model to extract embeddings
  • Verify consistent ID between images based on similarity

Cropping and Segmentation

  • Crop main focus region and person segment using Mask2Former

Captioning and Marking

  • Generate captions with BLIP model
  • Mark corresponding class words for identities
source- here

Applications

PhotoMaker empowers intuitive control over identity in generated images leading to creative applications:

Attribute Editing: Properties like hair color, age and accessories can be specified in the text prompt. The model modifies input characteristics accordingly while preserving recognition.

Artwork to Reality: Paintings or vintage photos of people can be transformed to realistic modern day counterparts in various contexts.

Identity Mixing: Pooling multiple IDs, including fictional characters, merges their visual features into a hybrid human.

Stylization: Artistic renditions like Ukiyo-e portraits can be produced by providing a style example while maintaining identity.

source- here

The flexibility stems from the stack embedding’s ability to fuse aspects from separate inputs. This releases more possibilities compared to single image conditional methods.

Detailed Model Architecture

Now we take a brief look at the complete architecture and training process of PhotoMaker:

Note: I am just giving Idea here in short bullet points, for detal please check research paper

Base Model

  • Stable Diffusion v1–4 XL is used as the backbone generator
  • Chosen for state-of-the-art text-to-image capabilities
  • Strong results for high-resolution photorealistic human images
  • UNet architecture with dual attention mechanism
  • 48 Transformer layers with dimension d=2016
  • Trained on LAION-400M diverse image-text pairs

Encoders

  • Image Encoder: CLIP ViT-L/14
  • 14 Transformer layers forming visual backbone
  • Patch size of 32x32 pixels, hidden size d=1024
  • Pre-trained on 400M image-text pairs
  • Fine-tune top 4 layers during PhotoMaker training
  • Projection head added to match text embedding dimensions
  • Text Encoder: Dual from Stable Diffusion
  • RNN-based 5 layer encoder with d=1536
  • Transformer-based 12 layer encoder with d=1024
  • Both pre-trained from scratch on LAION-400M

Losses

  • Adversarial Loss, Masked Diffusion Loss
  • Discriminator updates on real vs. generated images
  • Optimizes general image quality and coherence
  • Compares early noised latents in masked identity regions
  • Ensures faithfulness of personalization, Classifier-Free Guidance and Adds lightweight decoder guidance
  • Improves text-image semantic consistency

Optimization Scheme

  • PhotoMaker components trained from scratch
  • Stable Diffusion backbone frozen for efficiency
  • Adam optimizer with learning rates 1e-4 and 1e-5
  • 48 batch size for 8 days on 8 A100 40GB GPUs
  • FP16 mixed precision for faster throughput

The composed model strikes a balance between leveraging state-of-the-art pretrained networks and introducing customizations

The resulting model strikes an optimal balance between quality, editability and inference speed by exploiting state-of-the-art methods as building blocks within an innovative framework tailored for human image generation.

At inference, only a forward pass is required to encode input images. This allows realistic identity-consistent editing in around 10 seconds on a V100 GPU rather than hours of fine-tuning.

Evaluation

PhotoMaker was rigorously evaluated on held-out test identities and prompts against comparisons like DreamBooth and FastCompose:

  • Achieves best identity preservation with +11 DINO points vs FastCompose
  • High prompt relevance only -2 CLIP-T points below DreamBooth
  • Generates more diverse facial expressions than all methods
  • Matches overall image quality with state-of-the-art FID
  • Roughly 100x faster inference than DreamBooth

The results validate effectiveness at balancing fidelity, flexibility and speed. Ablations also confirm the benefits of critical proposal components.

🔧 Dependencies and Installation

pip install -r requirements.txt

⏬ Download Models

The model will be automatically downloaded through following two lines:

from huggingface_hub import hf_hub_download
photomaker_path = hf_hub_download(repo_id="TencentARC/PhotoMaker", filename="photomaker-v1.bin", repo_type="model")

You can also choose to download manually from this url.

💻 How to Test

Realistic generation

import torch
import os
from diffusers.utils import load_image
from diffusers import EulerDiscreteScheduler
from photomaker.pipeline import PhotoMakerStableDiffusionXLPipeline
## I downloaded the model locally in my colab notebook
## gloal variable and function
def image_grid(imgs, rows, cols, size_after_resize):
    assert len(imgs) == rows*cols

    w, h = size_after_resize, size_after_resize
    
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        img = img.resize((w,h))
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

base_model_path = 'SG161222/RealVisXL_V3.0'
photomaker_path = 'release_model/photomaker-v1.bin'
device = "cuda"
save_path = "./outputs"

# Load base model
pipe = PhotoMakerStableDiffusionXLPipeline.from_pretrained(
    base_model_path, 
    torch_dtype=torch.bfloat16, 
    use_safetensors=True, 
    variant="fp16",
#     local_files_only=True,
).to(device)

# Load PhotoMaker checkpoint
pipe.load_photomaker_adapter(
    os.path.dirname(photomaker_path),
    subfolder="",
    weight_name=os.path.basename(photomaker_path),
    trigger_word="img"
)     

pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()
# define and show the input ID images
input_folder_name = './examples/newton_man'
image_basename_list = os.listdir(input_folder_name)
image_path_list = sorted([os.path.join(input_folder_name, basename) for basename in image_basename_list])

input_id_images = []
for image_path in image_path_list:
    input_id_images.append(load_image(image_path))
    
input_grid = image_grid(input_id_images, 1, 4, size_after_resize=224)
print("Input ID images:")
input_grid
# Note that the trigger word `img` must follow the class word for personalization
prompt = "sci-fi, closeup portrait photo of a man img wearing the sunglasses in Iron man suit, face, slim body, high quality, film grain"
negative_prompt = "(asymmetry, worst quality, low quality, illustration, 3d, 2d, painting, cartoons, sketch), open mouth"
generator = torch.Generator(device=device).manual_seed(42)

## Parameter setting
num_steps = 50
style_strength_ratio = 20
start_merge_step = int(float(style_strength_ratio) / 100 * num_steps)
if start_merge_step > 30:
    start_merge_step = 30
    
images = pipe(
    prompt=prompt,
    input_id_images=input_id_images,
    negative_prompt=negative_prompt,
    num_images_per_prompt=4,
    num_inference_steps=num_steps,
    start_merge_step=start_merge_step,
    generator=generator,
).images
# Show and save the results
## Downsample for visualization
grid = image_grid(images, 1, 4, size_after_resize=512)

os.makedirs(save_path, exist_ok=True)
for idx, image in enumerate(images):
    image.save(os.path.join(save_path, f"photomaker_{idx:02d}.png"))
    
print("Results:")
grid

For more examples, you can check here

Stylization generation

Note: only change the base model and add the LoRA modules for better stylization

import torch
import numpy as np
import random
import os
from PIL import Image

from diffusers.utils import load_image
from diffusers import DDIMScheduler
from huggingface_hub import hf_hub_download

from photomaker.pipeline import PhotoMakerStableDiffusionXLPipeline
# gloal variable and function
def image_grid(imgs, rows, cols, size_after_resize):
    assert len(imgs) == rows*cols

    w, h = size_after_resize, size_after_resize
    
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        img = img.resize((w,h))
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

base_model_path = './civitai_models/sdxlUnstableDiffusers_v11.safetensors'
photomaker_path = './release_model/photomaker-v1.bin'
lora_path = './civitai_models/xl_more_art-full.safetensors' 
#here is lora path rest code is same actually

device = "cuda"
save_path = "./outputs"
# Load base model
pipe = PhotoMakerStableDiffusionXLPipeline.from_single_file(
    base_model_path, 
    torch_dtype=torch.bfloat16, 
    original_config_file=None,
).to(device)

# Load PhotoMaker checkpoint
pipe.load_photomaker_adapter(
    os.path.dirname(photomaker_path),
    subfolder="",
    weight_name=os.path.basename(photomaker_path),
    trigger_word="img"
)     

pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
print("Loading lora...")
pipe.load_lora_weights(os.path.dirname(lora_path), weight_name=os.path.basename(lora_path), adapter_name="xl_more_art-full")
pipe.set_adapters(["photomaker", "xl_more_art-full"], adapter_weights=[1.0, 0.5])
pipe.fuse_lora()

# define and show the input ID images
image_path = './examples/scarletthead_woman/scarlett_0.jpg'

input_id_images = []
input_id_images.append(load_image(image_path))
    
input_grid = image_grid(input_id_images, 1, 1, size_after_resize=224)
print("Input ID images:")
input_grid
## Note that the trigger word `img` must follow the class word for personalization
prompt = "A girl img riding dragon over a whimsical castle, 3d CGI, art by Pixar, half-body, screenshot from animation"
negative_prompt = "realistic, photo-realistic, bad quality, bad anatomy, worst quality, low quality, lowres, extra fingers, blur, blurry, ugly, wrong proportions, watermark, image artifacts, bad eyes, bad hands, bad arms"
generator = torch.Generator(device=device).manual_seed(42)

## Parameter setting
num_steps = 50
style_strength_ratio = 20
start_merge_step = int(float(style_strength_ratio) / 100 * num_steps)
if start_merge_step > 30:
    start_merge_step = 30
    
images = pipe(
    prompt=prompt,
    input_id_images=input_id_images,
    negative_prompt=negative_prompt,
    num_images_per_prompt=4,
    num_inference_steps=num_steps,
    start_merge_step=start_merge_step,
    generator=generator,
).images
# Show and save the results
## Downsample for visualization
grid = image_grid(images, 1, 4, size_after_resize=512)

os.makedirs(save_path, exist_ok=True)
for idx, image in enumerate(images):
    image.save(os.path.join(save_path, f"photomaker_style_{idx:02d}.png"))
    
print("Results:")
grid

Resources Required:

As you can see above with RTX 4090 i was able to run it on local PC

Conclusion

PhotoMaker introduces an elegant technique to distill visual identity cues through the stacked embedding for diffusion model text conditioning. This uniqueness enables practical applications within an efficiently trainable framework.

Ongoing innovations in conditioned image synthesis hold exciting potential. Embedding identity semantics is a promising direction to pursue, as PhotoMaker has demonstrated by producing creative, recognizable and customizable human portraits through intuitive text prompts.

Created by Dal-E

If you are interested:

Artificial Intelligence
Diffusion Models
Image Editing
Image Processing
Generative Ai Tools
Recommended from ReadMedium