人工智能生成大师级建筑设计

人工智能生成大师级建筑设计: Comparison

Please note this is a comparison between Version 1 by Junming Chen and Version 5 by Peter Tang.

The outstanding buildings designed by master architects are the common wealth of mankind. They reflect their design skills and concepts and are not possessed by ordinary architectural designers. Compared with traditional methods that rely on a lot of mental labor for innovative design and drawing, artificial intelligence (建筑大师设计的卓越建筑是人类共同的财富，体现了他们的设计技巧和理念，是普通建筑设计师所不具备的。与依赖大量脑力劳动进行创新设计和绘图的传统方法相比，人工智能（AI) methods have greatly improved the creativity and efficiency of the design process. It overcomes the difficulty in specifying styles for generating high-quality designs in traditional diffusion models.）方法大大提高了设计过程的创造力和效率。它克服了传统扩散模型中生成高质量设计的指定风格困难的问题。

architectural design
text to design
design process optimization
design quality
design style
diffusion model

1. Introduction一、简介

1.1. Background and Motivation

1.1. 背景和动机

Often the icon of a city, excellent architecture can attract tourists and promote local economic development [1]. However, designing outstanding architecture via conventional design methods poses multiple challenges. For one thing, conventional design methods involve a significant amount of manual drawing and design modifications ^[2][3][4][2,3,4], resulting in low design efficiency [4]. For another thing, cultivating designers with superb skills and ideas usually proves difficult ^[4][5][4,5], hence low-quality and inefficient architectural designs ^[5][6][5,6]. Such issues in the construction industry warrant urgent solutions.

1.2. Problem Statement and Objectives

Artificial intelligence (AI) has been widely used in daily life ^{[7][8][9][10]}[7,8,9,10]. Specifically, diffusion models can assist in addressing the low efficiency and quality in architectural design. Based on the machine learning concept, diffusion models are trained by learning knowledge from a vast amount of data ^[11][12] [11,12] to generate diverse designs based on text prompts [13]. Nevertheless, the current mainstream diffusion models, such as Stable Diffusion [14], Midjourney [15], and DALL E2 [11], have limited applications in architectural design due to their inability to embed specific design style and form in the generated architectural designs ( Figure 1 ).

Figure 1. Mainstream diffusion models compared with the proposed method for generating architectural designs. Stable Diffusion [14] fails to generate an architectural design with a specific style, and the image is not aesthetically pleasing (left panel). The architectural design styles generated by Midjourney [15] (second from the left) and DALL E2 ^[11] [11] (third from the left) are incorrect. None of these generated images met the design requirements. The proposed method (far right) generates architectural design in the correct design style . (Prompt: “An architectural photo in the Shu Wang style, photo, realistic, high definition”).

2. Architectural Design

Architectural designing relies on the professional skills and concepts of designers. Outstanding architectural designs play a crucial role in showcasing the image of a city ^[2][3][4][2,3,4]. Moreover, iconic landmark architecture in a city stimulates local employment and boosts the tourism industry ^[1][16][1,25].

Designers typically communicate architectural design proposals with clients through visual renderings. However, this conventional method has low efficiency and low quality. The inefficiency stems from the complexity of the conventional design process involving extensive manual drawing tasks ^[2][5][2,5], such as creating 2D drawings, building 3D models, applying material textures, and rendering visual effects ^[17][26]. This linear design process restricts client involvement in the decision-making until producing the final rendered images. If clients find the design not to meet their expectations upon viewing the final images, designers must redo the entire design, leading to repetitive modifications ^[2][3][4][2,3,4]. Consequently, the efficiency of this design practice needs improvement ^[17][26].

The建筑设计质量低下的原因在于优秀设计师培养难度大、设计能力提升过程漫长。设计者缺乏设计技能导致设计质量难以提高[ reason for the low quality of architectural design is that it is difficult to train excellent designers and the process of improving design capabilities is long. Designers' lack of design skills makes it difficult to improve design quality ^[4][5][6]. However4, the improvement of design capabilities is a gradual process 5,6 and]。然而，设计能力的提升是一个渐进的过程，设计师必须不断学习新的设计方法，探索不同的设计风格[ designers2,3,6,27,28,29 must] constantly。同时，在复杂条件下寻求最佳设计方案也给设计者带来了巨大的挑战[ learn new2 design methods and explore different design styles ^{[2][3][6][18][19][20]}. At the same time, seeking5 the best design solution under complex conditions also brings huge challenges to designers ^[2][5].]。

All所有这些因素最终导致低效和低质量的架构设计 these[ factors ultimately lead to inefficient and low-quality architectural2 designs ^[2][5]. Therefore, new5 technologies must be introduced into the construction industry in a timely manner to solve these problems.]。因此，必须及时将新技术引入建筑行业来解决这些问题。

3. Diffusion Model扩散模型

In近年来，扩散模型迅速发展成为主流的图像生成模型[ recent19,20,21,22,30,31 ] ，可以使设计者快速获取图像，从而显着提高建筑设计效率和质量[ years, the diffusion model13 has rapidly developed into a mainstream image generation model ^{[21][22][23][24][25][26]}, which32 allows designers to quickly obtain images, thereby significantly improving the efficiency and quality of architectural design ^[13][27].]。

The传统的扩散模型包括前向过程和后向过程。在前向过程中，噪声不断地添加到输入图像中，将其转变为噪声图像。后向过程的目的是从噪声图像中恢复原始图像[ traditional33 diffusion]。通过学习图像去噪过程，扩散模型获得了生成图像的能力[ model32 includes forward process and backward process. During the forward process, noise34 is]。当需要生成具有特定设计元素的图像时，设计者可以将文本提示纳入扩散模型的去噪过程中，以生成一致的图像并实现对生成结果的可控性 [ 35、36、37、38、39 continuously]，40 added]。使用文本引导扩散模型进行图像生成的优点在于它们可以简单地控制图像的生成[ to the input image, transforming it into a noisy image. The purpose of the backward process is to restore the original image from the noisy image ^[28]. By learning the image denoising process, the diffusion model acquires the ability to generate images ^[27][29]. When there is a need to generate images with specific design elements, designers can incorporate text cues into the denoising process of the diffusion model to generate consistent images and achieve controllability over the generated results ^{[30][31][32][33][34][35]}. The advantage of using text-guided diffusion models for image generation is that they allow simple control of image generation ^{[12][14][36][37][38]} .12,14,41,42,43 ]。

Although尽管扩散模型在大多数领域都表现出色，但它们在建筑设计中的应用仍然有改进的空间[ diffusion models perform well in35 most fields, there44 is still room for improvement in their application in architectural design ^[30][39]. Specifically, the limitation comes from obtaining large amounts of Internet data for training, which lacks high-quality annotations with professional architectural terminology. As a result, the model fails to establish connections between architectural design and architectural language during the learning process, which makes it challenging to use professional design vocabulary to guide architectural design generation ^{[40][41][42][43]}. Therefore, it is necessary to collect high-quality architectural design images, annotate them with relevant information, and then fine-tune the model to adapt it to the architectural design task.]。具体来说，限制来自于获取大量互联网数据进行训练，而这些数据缺乏具有专业架构术语的高质量注释。结果，该模型在学习过程中无法在建筑设计和建筑语言之间建立联系，这使得使用专业设计词汇对建筑设计生成进行指导变得具有挑战性 [ 45,46,47,48]。因此，有必要收集高质量的建筑设计图像，用相关信息对其进行注释，然后对模型进行微调以使其适应建筑设计任务。

4. Model Fine-Tuning模型微调

Diffusion扩散模型通过针对新场景的整个再训练或微调来学习新的知识和概念。由于整个模型重新训练的成本巨大，需要大量图像数据集，并且训练时间较长[ models11 learn new knowledge and concepts through complete retraining or fine-tuning for new scenarios. Due to the huge cost of retraining the entire model, the15 need for large image datasets, and the long training time ^[11][15], model fine-tuning is currently the most feasible.]，模型微调是目前最可行的。

There有四种标准微调方法。第一个是文本反转[ are11,36,46,49 four standard methods of fine-tuning. The first is text inversion ^{[11][31][41][44]}, which freezes the text-to-image model and provides only the most suitable embedding vectors to embed new knowledge. This method provides fast model training and minimal generative models, but the image generation effect is mediocre. The second is the ]，即冻结文本到图像模型，仅提供最合适的嵌入向量来嵌入新知识。该方法提供了快速的模型训练和最少的生成模型，但图像生成效果普通。第二种是Hypernetwork[ ^[42]47 method, which inserts a separate small neural network in the middle layer of the original diffusion model to affect the output. This method is faster to train, but the image generation effect is average. The third one is ]方法，即在原始扩散模型的中间层插入一个单独的小神经网络来影响输出。该方法训练速度较快，但图像生成效果一般。第三个是LoRA[ ^[43], which assigns weights to cross-layer attention to allow learning new knowledge. This method can generate models with an average size of hundreds of 48]，即为跨层注意力分配权重，以允许学习新知识。该方法在中等训练时间后可以生成平均数百MB after moderate training time, and the image generation effect is good. The fourth is the 大小的模型，且图像生成效果较好。第四种是Dreambooth[ ^[40]45 method,]方法，即对原始扩散模型进行整体微调。使用该方法，设计了先验保留损失来训练扩散模型，使其能够生成符合提示的图像，同时防止过度拟合[ which is50 an overall fine-tuning of the original diffusion model. Using this method, a51 prior-preserving]。命名新知识时建议使用稀有词汇，以避免由于与原始模型词汇相似而导致语言漂移 loss[ was designed to50 train the diffusion model, enabling51]。该方法只需要特定主题的 it to generate images consistent with the cues while preventing overfitting ^[45][46]. It is recommended to use rare vocabulary when naming new knowledge to avoid language drift due to similarity with the original model vocabulary ^[45][46]. This method requires only 3 to到 5 images个图像以及相应的文本描述，即可针对特定情况进行微调，并将特定文本描述与输入图像的特征相匹配。微调模型根据特定主题词和一般描述符生成图像[ of a specific subject and corresponding text31 descriptions, can46 be]。由于整个模型是使用 fine-tuned for specific situations, and matches specific text descriptions to the characteristics of the input images. Fine-tuned models generate images based on specific topic terms and general descriptors ^[26][41]. Since the entire model is fine-tuned using the Dreambooth method, the results produced are usually the best of these methods.方法进行微调的，因此产生的结果通常是这些方法中最好的。