Figure 1. Mainstream diffusion models compared with the proposed method for generating architectural designs. Stable Diffusion
[14] fails to generate an architectural design with a specific style, and the image is not aesthetically pleasing (left panel). The architectural design styles generated by Midjourney
[15] (second from the left) and DALL E2
[11] (third from the left) are incorrect. None of these generated images met the design requirements. The proposed method (far right) generates architectural design in the correct design style . (Prompt: “An architectural photo in the Shu Wang style, photo, realistic, high definition”).
2. Architectural Design
Architectural designing relies on the professional skills and concepts of designers. Outstanding architectural designs play a crucial role in showcasing the image of a city
[2][3][4]. Moreover, iconic landmark architecture in a city stimulates local employment and boosts the tourism industry
[1][16].
Designers typically communicate architectural design proposals with clients through visual renderings. However, this conventional method has low efficiency and low quality. The inefficiency stems from the complexity of the conventional design process involving extensive manual drawing tasks
[2][5], such as creating 2D drawings, building 3D models, applying material textures, and rendering visual effects
[17]. This linear design process restricts client involvement in the decision-making until producing the final rendered images. If clients find the design not to meet their expectations upon viewing the final images, designers must redo the entire design, leading to repetitive modifications
[2][3][4]. Consequently, the efficiency of this design practice needs improvement
[17].
The reason for the low quality of architectural design is that it is difficult to train excellent designers and the process of improving design capabilities is long. Designers' lack of design skills makes it difficult to improve design quality
[4][5][6]. However, the improvement of design capabilities is a gradual process , and designers must constantly learn new design methods and explore different design styles
[2][3][6][18][19][20]. At the same time, seeking the best design solution under complex conditions also brings huge challenges to designers
[2][5].
All these factors ultimately lead to inefficient and low-quality architectural designs
[2][5]. Therefore, new technologies must be introduced into the construction industry in a timely manner to solve these problems.
3.Diffusion Model
In recent years, the diffusion model has rapidly developed into a mainstream image generation model
[21][22][23][24][25][26], which allows designers to quickly obtain images, thereby significantly improving the efficiency and quality of architectural design
[13][27].
The traditional diffusion model includes forward process and backward process. During the forward process, noise is continuously added to the input image, transforming it into a noisy image. The purpose of the backward process is to restore the original image from the noisy image
[28]. By learning the image denoising process, the diffusion model acquires the ability to generate images
[27][29]. When there is a need to generate images with specific design elements, designers can incorporate text cues into the denoising process of the diffusion model to generate consistent images and achieve controllability over the generated results
[30][31][32][33][34][35]. The advantage of using text-guided diffusion models for image generation is that they allow simple control of image generation
[12][14][36][37][38] .
Although diffusion models perform well in most fields, there is still room for improvement in their application in architectural design
[30][39]. Specifically, the limitation comes from obtaining large amounts of Internet data for training, which lacks high-quality annotations with professional architectural terminology. As a result, the model fails to establish connections between architectural design and architectural language during the learning process, which makes it challenging to use professional design vocabulary to guide architectural design generation
[40][41][42][43]. Therefore, it is necessary to collect high-quality architectural design images, annotate them with relevant information, and then fine-tune the model to adapt it to the architectural design task.
4. Model Fine-Tuning
Diffusion models learn new knowledge and concepts through complete retraining or fine-tuning for new scenarios. Due to the huge cost of retraining the entire model, the need for large image datasets, and the long training time
[11][15], model fine-tuning is currently the most feasible.
There are four standard methods of fine-tuning. The first is text inversion
[11][31][41][44], which freezes the text-to-image model and provides only the most suitable embedding vectors to embed new knowledge. This method provides fast model training and minimal generative models, but the image generation effect is mediocre. The second is the Hypernetwork
[42] method, which inserts a separate small neural network in the middle layer of the original diffusion model to affect the output. This method is faster to train, but the image generation effect is average. The third one is LoRA
[43], which assigns weights to cross-layer attention to allow learning new knowledge. This method can generate models with an average size of hundreds of MB after moderate training time, and the image generation effect is good. The fourth is the Dreambooth
[40] method, which is an overall fine-tuning of the original diffusion model. Using this method, a prior-preserving loss was designed to train the diffusion model, enabling it to generate images consistent with the cues while preventing overfitting
[45][46]. It is recommended to use rare vocabulary when naming new knowledge to avoid language drift due to similarity with the original model vocabulary
[45][46]. This method requires only 3 to 5 images of a specific subject and corresponding text descriptions, can be fine-tuned for specific situations, and matches specific text descriptions to the characteristics of the input images. Fine-tuned models generate images based on specific topic terms and general descriptors
[26][41]. Since the entire model is fine-tuned using the Dreambooth method, the results produced are usually the best of these methods.