Assessment of Parent–Child Interaction Quality from Dyadic Dialogue

Assessment of Parent–Child Interaction Quality from Dyadic Dialogue: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor:

Chaohao Lin

Ou Bai

Jennifer Piscitello

Emily L. Robertson

Brittany Merrill

Kellina Lupas

William E. Pelham

The quality of parent–child interaction is critical for child cognitive development. The Dyadic Parent–Child Interaction Coding System (DPICS) is commonly used to assess parent and child behaviors. However, manual annotation of DPICS codes by parent–child interaction therapists is a time-consuming task. To assist therapists in the coding task, researchers have begun to explore the use of artificial intelligence in natural language processing to classify DPICS codes automatically.

parent–child interaction
DPICS
text classification
natural language processing

1. Introduction

Although the quality of parent–child interaction (PCI) profoundly impacts a child’s cognitive and socio-emotional development, PCI can be a challenging issue [1,2]. Therefore, parent–child interaction therapy (PCIT), a therapeutic approach, is crafted to assist parents of children experiencing early behavior problems in enhancing their relationship with their child and effectively managing their child’s behavior [3]. PCIT is linked to favorable outcomes for both children and families, leading to a decrease in child behavior problems and alleviation of family stress [4,5]. The Dyadic Parent–Child Interaction Coding System (DPICS) was developed in tandem with PCIT to monitor treatment progress. The DPICS allows for the quantification of child and parent behaviors in dyadic interaction, and DPICS has been extensively employed in the assessment of parent-child interaction quality and treatment outcomes. The DPICS is typically coded manually by a trained therapist or research staff [6]. This can be problematic, as time spent training to code to fidelity is costly. Additionally, if large amounts of data are being collected, time spent coding can delay the research process significantly.

Artificial intelligence is an emerging trend propelled by the swift advancement of machine learning and deep learning technologies. The goal of artificial intelligence is to create intelligent agents that are capable of completing tasks in a manner similar to humans. The state-of-the-art results and superhuman achievements have been attained in many fields, including AlphaGo in the Go game, Atlas of Boston Dynamics in whole-body robots, and recent conversational dialogue agent ChatGPT. Within the realm of natural language processing, pre-trained autoregressive deep learning language models like BERT and GPT have gained growing popularity [7,8]. Giving computers the ability to understand human language has long been a goal of artificial intelligence in natural language processing, and pre-trained models are fed massive raw documents in the hopes of identifying relationships among words or sentences.

Labeling DPICS codes is a laborious and time-consuming task for both experts and therapists. To assist PCIT therapists, Huber et al. introduced the SpecialTime system, designed to offer parents feedback as they engage in at-home practice of PCIT skills [9]. The developed SpecialTime system can automatically classify child-directed dialogue acts into the eight DPICS classes. B

2. Assessment of Parent–Child Interaction Quality from Dyadic Dialogue

2.1. Text Feature Extraction

2.1.1. Text Representation

When working with text in machine learning models, we need to convert the text into numerical vectors so that the models can process it. Two common methods for achieving this are one-hot encoding and integer encoding. One-hot encoding generates a vector whose length matches the vocabulary size and places a “1” in the index that corresponds to the word. This approach is inefficient because most values in the resulting vector are zero. In contrast, integer encoding assigns a unique integer value to each word. While this approach creates a dense vector that can be more efficient for machine learning models, it does not capture any relationships between the words, meaning that there is no inherent similarity between the encoded values of two words. For example, the integer values assigned to “he” and “she” have no relationship to each other, despite their semantic similarity. This limitation can pose challenges for specific natural language processing tasks, especially those that require a nuanced understanding of relationships between words.

Apart from one-hot encoding and unique numbers, previous techniques such as Bag of Words (BoW) and Term Frequency–Inverse Document Frequency (TF-IDF) have been used for converting the text to numerical vectors [10,11]. BoW and TF-IDF are both statistical measurement methods. There are also several variants, such as n-gram models and smoothed variants of TF-IDF.

Bag of Words

The Bag of Words (BoW) technique is extensively employed as a text representation method in NLP. It involves converting a piece of text into a collection of individual words or terms, along with their respective frequencies [10]. To create a BoW model, the text is pre-processed to remove stopwords and punctuation. Each word in the preprocessed text is then tokenized and counted, resulting in a dictionary of unique words and their respective frequencies. Ultimately, the text is represented as a vector with a length corresponding to the size of the dictionary. Despite its widespread use, BoW has several limitations. At first, BoW disregards the order and context of words in the text, potentially leading to the loss of crucial information regarding the meaning and context of individual words. Secondly, the vocabulary size can be very large, resulting in a high-dimensional vector space that can be computationally expensive and require too much memory. Thirdly, stopwords, which are common words like “the” and “a”, can dominate the frequency count and mislead the model. Finally, most documents only contain a small subset of the words in the vocabulary, resulting in sparse vectors that can make it difficult to compare documents or compute similarity measures.

Although BoW has some weaknesses, BoW is a widespread and effective technique for tasks such as text classification or sentiment analysis especially when combined with other techniques like feature selection and dimensionality reduction [12,13,14,15].

Term Frequency–Inverse Document Frequency

Term Frequency–Inverse Document Frequency (TF-IDF) is a statistical measure employed to determine the relevance of words in a text document or corpus. TF-IDF frequently serves as a weighting factor in information retrieval searches, text mining, and user modeling [16].

TF-IDF is composed of two metrics: term frequency (TF) and inverse document frequency (IDF). The TF score measures how often words appear in a particular document. In simple words, TF counts the occurrences of words in a document. The weight of a term is directly proportional to its frequency in the document. This implies that words appearing more frequently in a document are assigned a higher weight [11]. In contrast, IDF measures the rarity of words in the text, assigning more importance to infrequently used words in the corpus that may carry significant information. By integrating IDF, TF-IDF reduces the significance of frequently occurring terms while amplifying the importance of less common terms [17].

TF–IDF has been one of the most widely used methods in NLP and machine learning for tasks like document classification, text summarization, sentiment classification, and spam message detection. For example, it can identify the most relevant words in a document and then apply these words as features in a classification model. A survey conducted in 2015 on text-based recommender systems found that 83% of them used TF-IDF [18]. Furthermore, many previous studies have demonstrated the effectiveness of TF-IDF for tasks like automated text classification and sentiment analysis [19,20,21,22,23]. However, TF-IDF has limitations. TF-IDF does not efficiently capture the semantic meaning of words in a sequence or consider the order in which terms appear. Additionally, TF-IDF can be biased towards longer documents, meaning that longer documents will generally have higher scores than shorter ones.

2.1.2. Word Embedding

Word embeddings are a form of representation learning employed in NLP, facilitating computers in comprehending the relationships between words. Humans have always excelled at understanding the relationship between words such as man and woman, cat and dog, etc. Word embedding has been developed to represent these relationships as numeric vectors in an n-dimensional space. In this context, words with similar meanings share comparable representations, implying that two related words are depicted by nearly identical vectors positioned closely in the vector space. This technique has been used effectively in various NLP tasks, such as sentiment analysis and machine translation. However, creating effective word embeddings is a significant and premier issue in NLP because the quality of word embeddings can impact the performance of downstream tasks. Moreover, ingenious word representations in a lower dimensional space can be more beneficial and train a model faster, making the creation of effective word embeddings a critical research area.

Word2Vec

Word2Vec is a popular technique for learning word embeddings using shallow neural networks, developed by [24]. Word2Vec comprises two distinct models: Continuous Bag of Words (CBOW) and Continuous Skip-gram. The CBOW model predicts the middle word based on surrounding context words, while Skip-gram predicts the surrounding words given a target word. In CBOW, the context comprises a few words before and after the middle word [25].

Global Vectors for Word Representation

Global Vectors for Word Representation (GloVe) is an algorithm that generates word embeddings by using matrix factorization techniques on a word-context matrix. To create the word-context matrix, a large corpus is scanned for each term, and context terms within a window defined by a window size before and after the term are counted. The resulting matrix contains co-occurrence information for each word (the rows) and its context words (the columns). To account for the decreasing importance of words as their distance from the target word increases, a weighting function is used to assign lower weights to more distant words [26,27].

2.1.3. Transformer

In a study by Vaswani et al. (2017), an attention-based algorithm called Transformers was introduced [28]. Transformers are a unique type of sequence transduction model that rely solely on attention rather than recurrence. This approach allows for the consideration of more global relationships in longer input and output sequences. As a result, Transformers have recently been utilized in natural language processing to address various challenges.

Bidirectional Encoder Representations from Transformers

BERT is a self-supervised learning model for learning language representations that was released by Google AI in 2018 [8]. BERT introduces a masked bidirectional language modeling objective that leverages context learned from both directions to predict randomly masked tokens, allowing it to better capture contextualized word associations. BERT belongs to a class of models known as transformers, and comes in two variants: BERT-Base, which incorporates 110 million parameters, and BERT-Large, which boasts 340 million parameters. BERT relies on an attention mechanism to generate high-quality, contextualized word embeddings [28]. The attention mechanism captures the word associations based on the words to the left and right of each word as it passes through each BERT layer during training. Compared to traditional techniques like BoW and TF-IDF, BERT is a revolutionary technique for creating better word embeddings, thanks to its pretraining on Wikipedia datasets and massive word corpus. BERT has been successfully applied to many NLP tasks, including language translation [29,30,31].

DistilBERT

DistilBERT is a highly efficient and cost-effective variant of the BERT model that was developed by distilling BERT-base. With 40% fewer parameters than bert-base-uncased, DistilBERT is both small and lightweight. Additionally, it runs 60% faster than BERT while maintaining an impressive 97% performance on the GLUE language understanding benchmark [32].

RoBERTa

Yinhan Liu et al. proposed a robust approach called the Robustly Optimized BERT-Pretraining Approach (RoBERTa) in 2019, which aims to improve upon the original BERT model for pretraining natural language processing (NLP) systems [33]. RoBERTa shares the same architecture as BERT, but incorporates modifications to the key hyperparameters and minor embedding tweaks to increase robustness. Unlike BERT, RoBERTa does not use the next-sentence pretraining objective, and instead trains the model with much larger mini-batches and learning rates. Additionally, RoBERTa is trained using full sentences, dynamic masking, and a larger byte-level byte-pair encoding (BPE) technique. RoBERTa has been widely adopted in downstream NLP tasks and has achieved outstanding results compared to other models [34,35,36].

2.2. Text Classification

Text classification is also referred to as text tagging or text categorization. The aim is to categorize and classify text into organized groups. Text classifiers can automatically analyze provided text and assign a set of pre-defined tags or categories based on its content.

While human experts are still considered the most reliable method for text classification, manual classification can be a complex, tedious, and costly task. With the advancement of NLP, text classification has become increasingly important, particularly in areas such as sentiment analysis, topic detection, and language detection. Various machine learning and deep learning methods have been employed for sentiment analysis, with Twitter being a popular data source [37,38,39,40]. Supervised methods, including decision trees, random forests, logistic regression, support vector machines (SVMs), and naive Bayes, have been used to train classifiers [41,42]. However, supervised approaches require labeled data, which can be expensive. To address this, unsupervised learning methods, such as that proposed by Pandarachalil et al., have been suggested [43].

2.3. Dyadic Parent–Child Interaction Coding System and Parent–Child Interaction Therapy

The Dyadic Parent–Child Interaction Coding System, fourth edition (DPICS-IV), is a structured behavioral observation tool that assesses essential parent and child behaviors in standardized situations. DPICS-IV has proven to be a valuable adjunct to PCIT and has been used extensively to evaluate other parenting interventions and research objectives as well [6]. Over the years, DPICS has been utilized in various studies addressing a wide range of clinical and research questions. Nelson et al. highlight the development of DPICS and discuss its current usage as a treatment process or outcome variable. The authors also provide a summary of the ways in which DPICS has been adapted and describe the process by which it is designed to undergo adaptation [45].

The DPICS-IV scoring system is based on the frequency counts of ten main categories, Neutral Talk, Labeled Praise, Unlabeled Praise, Behavior Description, Reflection, Information Question, Descriptive Question, Direct Commands, Indirect Commands, and Negative Talk. However, in previous work, eight categories were commonly used, where Information Question and Descriptive Question were combined into a single category called Question, and Indirect Commands and Direct Commands were combined as Commands [9,46,47]. Both Cañas et al. and Huber et al. have suggested that not all DPICS codes are equally important for therapy outcomes and have placed more emphasis on Negative Talk. In addition, Cañas et al. found that the DPICS Negative Talk factor demonstrated a high discriminant capacity (AUC = 0.90) between samples, and a cut-off score of 8 allowed the classification of mother–child dyads with 82% sensitivity and 89% specificity [46].

The process of labeling DPICS codes manually for each sentence in a conversation is a time-consuming and labor-intensive task that requires trained experts. Confirmatory factor analysis is then used to verify the factor structure of the observed variables [48]. However, Huber et al. have developed SpecialTime, an automated system that can classify transcript segments into one of eight DPICS classes. The system uses a linear support vector machine trained on text feature representations obtained using TF-IDF and part-of-speech tags. The system achieves an overall accuracy of 78%, as evaluated by the authors using an expert-labeled corpus [9].

PCIT helps parents improve interaction quality with children with behavior problems. The therapy instructs parents to employ effective dialogue during interactions with their children [49].

This entry is adapted from the peer-reviewed paper 10.3390/app132011129

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.