Since the generative adversarial network (GAN) portrait painting titled “Edmond de Belamy” was constructed in 2018, AI art has already entered the public’s vision. One of the latest applications of AI is the generation of images based on natural language descriptions, which enhances the efficiency and effect of the transformation from creativity to visuality to a great extent. In the past, whether in traditional or digital painting creation, the author needed to be skilled in using tools and to have rich technical experience to accurately map the brain’s imagination to the visual layer. However, in co-creation with text-to-image AI generators, both artists and nonartists can input the text description to produce many high-quality images. During traditional painting creation, artists and nonartists in a painting task indicated quantitative and qualitative differences in some studies, such as artists spending more time on planning their painting, having more control over their creative processes, having more specific skills, and having more efficiency than nonartists. Whether such differences still exist in the new human–AI interaction mode and what new changes arise are worth discussing.
1. Introduction
A series of text-to-image AI systems, such as Disco Diffusion
[1], Midjourney
[2], Stable Diffusion
[3], OpenAI’s DALL-E 2
[4], and Google’s Imagen
[5], is making a big splash. The generation mechanism is to use a language–vision model to understand the “prompt” input by users, and then the generator is guided to produce high-quality images. They are capable of synthesizing images with any style and content based on a prompt. Besides, users can control the system to iterate more variations. With the rise of AI art, many artists have also started to use AI to assist in creation. According to the Colorado State Fair competition’s website
[6], the art piece “Théâtre D’opéra Spatial,” which was generated by Midjourney, won first place in the digital art category. As the formation of generators using natural language text to create various styles of creative images occurs, the question that arises immediately is: what is the essence of artistic creation, and what is the core capability of artists? Though everyone thought art was one thing robots could never do, maybe we will face the challenges of emerging AI technology.
2. Text-to-Image Systems
With the successful application of transformer-based architectures in neural language processing (NLP), text-to-image systems based on deep generative models have become popular means for computer vision tasks
[7][8]. They generate creative images combining concepts, attributes, and styles from expressive text descriptions
[9]. The primary generation mechanism is that a language–vision model (i.e., CLIP) is adopted to guide the generator to produce high-quality images.
When OpenAI released CLIP in 2021
[10], it spurred immense technical progress in text-to-image generation. CLIP is a pre-trained language–vision model that enables zero-shot image manipulation guided by text prompts. Unlike traditional representation learning that is based mostly on discretized labels, the vision–language model aligns images and texts in a common feature space, allowing zero-shot transfer to a downstream task via prompting
[11]. CLIP guides the generator to synthesize digital images when used as a discriminator in a generative system. Using its joint text–image representation space, people can control the synthesis process with natural language. At present, most programs use CLIP for text encodings, such as DALL-E 2 and Stable Diffusion. Differently, Google’s Imagen uses the T5-XXL language model to encode the text and then generate images directly without learning the priori model
[5]. The text input, known as the prompt, plays a crucial role in downstream datasets. It is an important aspect for improving the quality and changing the aesthetics of images, which entails the practice and capabilities of interacting with the system. The term prompt engineering knows the practice and skill of writing prompts due to its iterative and experimental nature
[12]. However, identifying the right prompt is a nontrivial task which often takes a significant amount of time for word tuning—a slight change in wording could make a huge difference in performance
[11].
Currently, text-to-image generation models can be divided into two designs: sequence-to-sequence modeling and diffusion-based modeling
[13]. The main idea of the sequence-to-sequence modeling design is to turn images into discrete image tokens via leveraging transformer-based image tokenizers and to employ the sequence-to-sequence architectures to learn the relationship between textual input and visual output from a large collection of text–image pairs, such as Vector Quantized Variational Autoencoder (VQ-VAE) and Vector Quantized Generative Adversarial Networks (VQ-GAN). VQ-VAE discretely incorporates ideas from vector quantization and encoder network outputs. Then, by pairing these representations with an autoregressive prior, the model with a PixelCNN decoder can generate high-quality images
[14]. This model is used by the first vision of DALL-E
[15]. More like a variant, VQ-GAN represents a variety of modalities with discrete latent representations by building a codebook vocabulary with a finite set of learned embeddings and using Transformer instead of the PixelCNN in VQ-VAE
[8]. Anyway, the PatchGAN discriminator is used to add anti-loss in the training process. The representative work of this modeling is Parti
[16]. Different from the above idea, the diffusion-based models, which are built from a hierarchy of denoising autoencoders, start from random noise and gradually denoise them, conditioned on textual descriptions, until images matching the conditional information are generated
[17]. Based on the power of diffusion models in high-fidelity image synthesis, the text-to-image system is significantly pushed forward by the recent effort of Disco Diffusion
[1], Midjourney
[2], Stable Diffusion
[3], DALL-E 2
[4], and Imagen
[5].
At present, the programs that use diffusion models for a better generation effect, Disco Diffusion, Midjourney, Stable Diffusion, and DALL-E 2, are open to the public, but the programs of Imagen are not. Disco Diffusion is a clip-guided diffusion model that is good at generating pretty abstract art, which can be run in Google Colab now
[1]. Midjourney was created by an independent research lab with the same name. It is currently in open beta and is accessible on Discord, where users type in the textual prompt in the chat, and then the artwork is generated by the AI system
[2]. Stable diffusion was released by Stability AI in 2022, which uses a latent diffusion mode trained on 512 × 512 images from a subset of the LAION-5B database. Similar to Google’s Imagen, this model uses a frozen CLIP ViT-L/14 text encoder to condition the model to text prompts
[18]. Furthermore, it has a better balance between speed and quality and can generate images within seconds
[3]. The main novelty of DALL-E 2 seems to be an extra layer of indirection with the prior network, which predicts an image embedding based on the text embedding from CLIP. Specifically, this repository will only build out the diffusion prior network, as it is the best-performing variant
[4].
With the emergence of such open-source implementations, the use of advanced text-to-image synthesis for generating images is becoming more widespread, which represents a relevant trend in the AI Art community
[19].
3. Communication between Artists and Audiences
Artistic creation is a process for artists to explore and express ideas and concepts. A great painting has much more below the surface than is first seen on the surface. Therefore, it must access the mind as well as the senses
[20]. Similar to how humans do not really know how they breathe, artists do not truly know how they create: while they may rely on a set of fundamental principles, such as how to arrange elements, light, colors, and other components, most of their creative decisions happen intuitively
[21]. The experimental result of Eindhoven and Vinacke demonstrated that artists have more control over their creative activities and produce better results than nonartists in the creative process of painting
[22]. Kay also found that nonartists, semiprofessional artists, and professional artists differed on certain process-related variables
[23].
The interplay between the internal (cognitive) representation and the external (physical) representation is a fascinating problem in cognitive psychology, art, science, and philosophy
[24]. The various painting attributes, such as colors, shapes, and boundaries, are selectively redistributed to the brain for processing. For example, color may be experienced as warm or cold or as cheerful or somber
[25]. Audiences can also perceive the painter’s actions by observing the brushstroke of the painting
[26]. Apart from that, from a psychological viewpoint, Kozbelt examined various experiments on artists’ perception and depiction skills and showed evidence suggesting possible perceptual differences between artists and nonartists
[27][28]. Aesthetic appreciation is an active process influenced by several objective features: external and subjective factors that engage both bottom-up and top-down processes
[29]. In the series of studies on experimental aesthetics by Lyu et al.
[30][31][32], the perception of artistic style was affected by individual attributes such as knowledge background and gender. Thus, the perception of art is a complex interaction process between the top and bottom levels, which is affected by various subjective and objective factors.
According to communication theory, the process of artist expression is called encoding, and the way the artwork is perceived by the audience is regarded as decoding
[33][34]. Jakobson proposed six constitutive factors with six functions in communication: the addresser, addressee, context, message, contact, and code
[34]. For example, an artist (addresser) sends a message to an audience (addressee) through his/her painting. The artist’s work, as the message with a story (context), plays a role in the connection between himself/herself and the audience (contact). Finally, his/her message must be based on a shared meaning system (code) by which his/her work is structured
[20]. There are three levels of problems, namely technical, semantic, and effectiveness levels, that were identified in the study on the communication of paintings
[31][35]. Among them, the technical level focuses on letting the addressee receive a message through visual attraction, and the semantic level requires that the addressee is allowed to understand the message’s meaning without misinterpreting it. The effectiveness level concerns the effect of the audience’s feelings. During the creative process of AI art, the artists choose AI algorithms according to their intentions for creating the artwork, and audience acceptance is a critical defining step in deciding whether it is “art”
[36]. Studying the process of art perception can help build a bridge between artists and the audience
[37][38].
4. Artworks Generated by Human–AI Co-Creation
Artworks are increasingly being created by machines through algorithms with little or no input from humans. At the Christie’s auction in 2018, the portrait “Edmond de Belamy”, generated by generative adversarial networks (GAN), was auctioned for $432,500, which indicates that AI has begun to enter the field of vision at a rapid speed
[39]. Recent works have addressed a variety of tasks such as classification, object detection, similarity retrieval, multimodal representations, and computational aesthetics, among others
[19]. The neural style transfer in which AI technology first intervened in the field of art has been widely used in the platforms such as Prisma, Deep Dream Generator, and other art content production platforms. In 2022, text-to-image AI art generators are much more popular and have been applied to creating conceptual scenes, creative designs, and fictional illustrations. In this case, it can be seen that the processes in various art creations are changing. Meanwhile, some new jobs have also been immediately emerging, such as prompt sale
[40].
With the explosion of AI-related technologies and their continuous application in the field of art, there is a growing body of research initiatives and creative applications arising at the intersection of AI and art. Artistic creation is embedded with cultural, historical, and institutional frameworks that directly interact with the artist’s own creative process
[21]. Lacking human consciousness, AI does not understand what it is doing and is merely a suite of statistical models calculating favorable odds through enormous variations. Considering that, AI cannot create art, but it can create patterns that an audience will likely perceive as art
[41]. The human artist, as the author, is always the mastermind behind the work, and the computer is a tool
[42]. However, AI technology is not like traditional tools. Its randomness changes the way humans control it. As a sparking trigger of inspiration, artists collaborate with AI agencies to augment the artistic process
[41].
As for text-based generative art, it is also argued that creativity does not lie in the final artifact but rather in the interaction with the AI and the practices that may arise from the human–AI interaction
[43]. It is not hard to imagine a future where text prompts could be generated by language models, thereby completely dehumanizing the creative artistic process and severely distorting the human perception of the meaning behind an image
[44]. Most studies reported that visual artworks can be recognized to some extent by humans, especially by experts of a specific art field
[45][46], but other experimental results showed that individuals are unable to accurately identify AI-generated artwork
[32][47]. Based on the researchers' previous research, the deep learning model, trained by large amounts of data on paintings, can simulate human painting skills on the technical level. In contrast, people prefer paintings connecting the semantic and emotional levels
[31].