Cross-lingual document retrieval, which aims to take a query in one language to retrieve relevant documents in another, has attracted strong research interest in the last decades. Most studies on this task start with cross-lingual comparisons at the word level and then represent documents via word embeddings, which leads to insufficient structure information.
1. Introduction
With the rapid growth of multilingual information on the Internet, cross-lingual document retrieval is becoming increasingly important for search engines. Monolingual information retrieval will miss information in other languages. This could be very important, for example, users may want to find news in foreign languages for the same event. However, current search engines usually return documents written in the same language, discarding many valuable results written in other languages. The information retrieval task is a difficult problem because queries and documents are likely to use different vocabularies when looking for correlations between them. This is more obvious in the task of cross-lingual document retrieval, thus, how to represent and compare documents across language barriers has attracted a lot of research and attempts.
To tackle the issue of the language barrier, many translation-based methods have achieved good results in cross-lingual retrieval tasks in the past decades
[1]. These methods translate queries or documents first and then use the monolingual retrieval method to rank the candidate documents. The retrieval performance is tied down by the machine translation method and lack of flexibility. On the one hand, as machine translation improves performance with high-resource corpora, the performance of cross-lingual document retrieval improves. On the other hand, the result of the retrieval task is particularly dependent on the translation quality, any translation errors and ambiguity from the source language or the target language will cause disasters for the retrieval results. Moreover, the amount of translation is always huge, and the cost of time and storage is always expensive
[1]. Therefore, large-scale translation in the Internet environment is impractical, also for some low-resource languages or domains which they do not contain enough data for training the machine translator, a more lightweight document representation is urgently needed
[2].
While, for the purpose of obtaining a more general cross-lingual document representation, many strategies have been proposed such as knowledge-base based approaches
[3][4]. Using concept collections from a knowledge base to represent documents avoids a lot of computational overhead, while it would lose most structural information of the documents themselves. This type of approach is limited by the conceptual scope of the knowledge base. Especially when low-resource languages are included, the number of the concept intersections covering all languages is much smaller. It is a heuristic method, which does not fully consider the document structure and cannot accurately cover the meaning of the document
[2][4]. Moreover, it is difficult to deal with words out of vocabulary, and at the same time, the document representation is not optimized via learning. There are also studies that combine speech features to improve the quality of multilingual document representations
[5][6] and representing documents based on features of machine translation and automatic speech recognition (ASR). Speech features can enrich the semantics of documents, and thus enhance the expressiveness of document representation. However, these studies rely on speech corpora and the quality of speech recognition features.
Although most cross-lingual document representation methods rely on high-resources language data or parallel corpus, some studies have proved that it is effective to solve the cross-lingual document retrieval problem based on the comparable corpus
[2][7][8]. It greatly alleviates the problem of resource scarcity. Most of these approaches achieve the cross-lingual at the lexical level first and then get the document embeddings, which is still a heuristic process.
2. Cross-Lingual Document Retrieval
With the popularity of pre-training methods and word embedding methods in the natural language processing (NLP) field, many cross-lingual word embeddings (CLWE) methods have also been proposed that have achieved a competitive cross-lingual retrieval performance in recent years
[8][9][10]. Generally, cross-lingual word embedding methods require different supervision signals, including vocabulary alignment, sentence alignment, and document alignment
[9][11]. Additionally, there are many unsupervised cross-lingual word embedding methods being studied
[12][13][14]. These methods obtain the cross-lingual vocabulary through supervised signal or unsupervised strategy first, then represent documents through similar ways of word embeddings combination
[11][14]. The structure of information in texts is not considered well and the embeddings are not optimized explicitly for the document level
[2]. To improve the quality of cross-lingual word embeddings and reduce the level of supervision, many follow-up studies have focused on the representation of similarity between languages
[11][15].
The spatial projection method was proposed to optimize cross-lingual word embeddings, which is a weakly supervised method
[16]. It has been verified that this simple linear mapping can achieve good results, and there are many studies to follow this strategy
[12]. The supervised method directly uses the existing dictionary, while the unsupervised method automatically builds the seed dictionary. Using a small number of initial dictionaries to get the vector space in which the two words are aligned, afterward, learn the projection of the conversion between the two spaces. This approach focuses on exploiting the similarity between word embedding spaces to learn this relationship
[17].
Vulić and Moens obtained pseudo-bilingual documents by merging document-aligned corpora and obtained cross-lingual word embeddings based on the skip-gram model
[8]. The work of Alexis et al. presents an unsupervised approach that achieves competitive results on word and sentence level retrieval problems, and this method also performs well on cross-lingual document retrieval tasks
[12][17]. In short most of the current methods still rely on parallel corpora, in addition, it is still necessary to define document representation based on word embedding
[18].
Most cross-lingual document embedding methods use alignment relationships to induce shared semantic spaces, which rely on a high-quality parallel corpus. In general scenarios, comparable corpora with topic alignment are more readily available than parallel corpora. Thus, approaches that require document-aligned, comparable data, prove promising as it significantly alleviates the resource scarcity problem.
One line of thought focuses on cross-lingual topic models, and most of them are based on the latent Dirichlet allocation (LDA) algorithm
[19][20]. Some approaches use the word-aligned corpus where the topic model is achieved by optimizing the semantic distribution of words
[21][22]. The disadvantage is that it is limited by multilingual vocabulary alignment resources
[23]. Other studies are focusing on the document alignment corpus, which utilize large aligned corpora effectively and map multilingual documents to corresponding topic distributions through training
[24][25][26][27]. The focus of these methods is on how to describe the same concept in multiple languages, while the approach is concerned more with establishing connections between multilingual documents and concepts.
Instead of using combined word embeddings to obtain documents, cross-lingual representation methods at the document level are also proposed and studied. Josifoski’s work
[2] proposes to obtain the document representations by minimizing the gap between monolingual words and cross-lingual terminology. The topic tags are directly used as supervised signals to induce cross-lingual document embeddings. It is a sufficiently complex problem because the number of tags is millions. Cr5 (cross-lingual reduced-rank ridge regression), a framework based on a linear algorithm is proposed to split the classification weights matrix, which is highly efficient for the massive tags. Experiments show that this linear model achieves better performance than the baseline in document retrieval tasks. Consequently, researchers will use Cr5 as the main baseline. The Cr5 model could be seen as an enhanced cross-lingual word representation since the word could be a document is this stage. However, due to the use of the bag-of-words model, although the frequency of word occurrence is considered, the semantic position of the word is ignored, and it is difficult to consider well of the text structure. Researchers propose a method for cross-lingual embeddings, which structures the problem in a multilabel classification setting and uses comparable corpus in an efficient and scalable manner.