Cross-Lingual Document Retrieval

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		kai feng	--	1283	2023-02-24 08:06:25	\|
2	update references and layout	Rita Xu	Meta information modification	1283	2023-02-24 08:51:25	\|

This entry is adapted from the peer-reviewed paper 10.3390/e24070943

Cross-lingual document retrieval, which aims to take a query in one language to retrieve relevant documents in another, has attracted strong research interest in the last decades. Most studies on this task start with cross-lingual comparisons at the word level and then represent documents via word embeddings, which leads to insufficient structure information.

cross-lingual document retrieval cross-lingual features

1. Introduction

With the rapid growth of multilingual information on the Internet, cross-lingual document retrieval is becoming increasingly important for search engines. Monolingual information retrieval will miss information in other languages. This could be very important, for example, users may want to find news in foreign languages for the same event. However, current search engines usually return documents written in the same language, discarding many valuable results written in other languages. The information retrieval task is a difficult problem because queries and documents are likely to use different vocabularies when looking for correlations between them. This is more obvious in the task of cross-lingual document retrieval, thus, how to represent and compare documents across language barriers has attracted a lot of research and attempts.

To tackle the issue of the language barrier, many translation-based methods have achieved good results in cross-lingual retrieval tasks in the past decades ^[1]. These methods translate queries or documents first and then use the monolingual retrieval method to rank the candidate documents. The retrieval performance is tied down by the machine translation method and lack of flexibility. On the one hand, as machine translation improves performance with high-resource corpora, the performance of cross-lingual document retrieval improves. On the other hand, the result of the retrieval task is particularly dependent on the translation quality, any translation errors and ambiguity from the source language or the target language will cause disasters for the retrieval results. Moreover, the amount of translation is always huge, and the cost of time and storage is always expensive ^[1]. Therefore, large-scale translation in the Internet environment is impractical, also for some low-resource languages or domains which they do not contain enough data for training the machine translator, a more lightweight document representation is urgently needed ^[2].

While, for the purpose of obtaining a more general cross-lingual document representation, many strategies have been proposed such as knowledge-base based approaches ^[3]^[4]. Using concept collections from a knowledge base to represent documents avoids a lot of computational overhead, while it would lose most structural information of the documents themselves. This type of approach is limited by the conceptual scope of the knowledge base. Especially when low-resource languages are included, the number of the concept intersections covering all languages is much smaller. It is a heuristic method, which does not fully consider the document structure and cannot accurately cover the meaning of the document ^[2]^[4]. Moreover, it is difficult to deal with words out of vocabulary, and at the same time, the document representation is not optimized via learning. There are also studies that combine speech features to improve the quality of multilingual document representations ^[5]^[6] and representing documents based on features of machine translation and automatic speech recognition (ASR). Speech features can enrich the semantics of documents, and thus enhance the expressiveness of document representation. However, these studies rely on speech corpora and the quality of speech recognition features.

Although most cross-lingual document representation methods rely on high-resources language data or parallel corpus, some studies have proved that it is effective to solve the cross-lingual document retrieval problem based on the comparable corpus ^[2]^[7]^[8]. It greatly alleviates the problem of resource scarcity. Most of these approaches achieve the cross-lingual at the lexical level first and then get the document embeddings, which is still a heuristic process.

2. Cross-Lingual Document Retrieval

With the popularity of pre-training methods and word embedding methods in the natural language processing (NLP) field, many cross-lingual word embeddings (CLWE) methods have also been proposed that have achieved a competitive cross-lingual retrieval performance in recent years ^[8]^[9]^[10]. Generally, cross-lingual word embedding methods require different supervision signals, including vocabulary alignment, sentence alignment, and document alignment ^[9]^[11]. Additionally, there are many unsupervised cross-lingual word embedding methods being studied ^[12]^[13]^[14]. These methods obtain the cross-lingual vocabulary through supervised signal or unsupervised strategy first, then represent documents through similar ways of word embeddings combination ^[11]^[14]. The structure of information in texts is not considered well and the embeddings are not optimized explicitly for the document level ^[2]. To improve the quality of cross-lingual word embeddings and reduce the level of supervision, many follow-up studies have focused on the representation of similarity between languages ^[11]^[15].

The spatial projection method was proposed to optimize cross-lingual word embeddings, which is a weakly supervised method ^[16]. It has been verified that this simple linear mapping can achieve good results, and there are many studies to follow this strategy ^[12]. The supervised method directly uses the existing dictionary, while the unsupervised method automatically builds the seed dictionary. Using a small number of initial dictionaries to get the vector space in which the two words are aligned, afterward, learn the projection of the conversion between the two spaces. This approach focuses on exploiting the similarity between word embedding spaces to learn this relationship ^[17].

Vulić and Moens obtained pseudo-bilingual documents by merging document-aligned corpora and obtained cross-lingual word embeddings based on the skip-gram model ^[8]. The work of Alexis et al. presents an unsupervised approach that achieves competitive results on word and sentence level retrieval problems, and this method also performs well on cross-lingual document retrieval tasks ^[12]^[17]. In short most of the current methods still rely on parallel corpora, in addition, it is still necessary to define document representation based on word embedding ^[18].

Most cross-lingual document embedding methods use alignment relationships to induce shared semantic spaces, which rely on a high-quality parallel corpus. In general scenarios, comparable corpora with topic alignment are more readily available than parallel corpora. Thus, approaches that require document-aligned, comparable data, prove promising as it significantly alleviates the resource scarcity problem.

One line of thought focuses on cross-lingual topic models, and most of them are based on the latent Dirichlet allocation (LDA) algorithm ^[19]^[20]. Some approaches use the word-aligned corpus where the topic model is achieved by optimizing the semantic distribution of words ^[21]^[22]. The disadvantage is that it is limited by multilingual vocabulary alignment resources ^[23]. Other studies are focusing on the document alignment corpus, which utilize large aligned corpora effectively and map multilingual documents to corresponding topic distributions through training ^[24]^[25]^[26]^[27]. The focus of these methods is on how to describe the same concept in multiple languages, while the approach is concerned more with establishing connections between multilingual documents and concepts.

Instead of using combined word embeddings to obtain documents, cross-lingual representation methods at the document level are also proposed and studied. Josifoski’s work ^[2] proposes to obtain the document representations by minimizing the gap between monolingual words and cross-lingual terminology. The topic tags are directly used as supervised signals to induce cross-lingual document embeddings. It is a sufficiently complex problem because the number of tags is millions. Cr5 (cross-lingual reduced-rank ridge regression), a framework based on a linear algorithm is proposed to split the classification weights matrix, which is highly efficient for the massive tags. Experiments show that this linear model achieves better performance than the baseline in document retrieval tasks. Consequently, researchers will use Cr5 as the main baseline. The Cr5 model could be seen as an enhanced cross-lingual word representation since the word could be a document is this stage. However, due to the use of the bag-of-words model, although the frequency of word occurrence is considered, the semantic position of the word is ignored, and it is difficult to consider well of the text structure. Researchers propose a method for cross-lingual embeddings, which structures the problem in a multilabel classification setting and uses comparable corpus in an efficient and scalable manner.

References

Nie, J.Y. Cross-Language Information Retrieval. Synth. Lect. Hum. Lang. Technol. 2010, 3, 1–125.
Josifoski, M.; Paskov, I.S.; Paskov, H.S.; Jaggi, M.; West, R. Crosslingual Document Embedding as Reduced-Rank Ridge Regression. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, Australia, 11–15 February 2019; pp. 744–752.
Potthast, M.; Stein, B.; Anderka, M. A Wikipedia-Based Multilingual Retrieval Model. In Proceedings of the 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, 30 March–3 April 2008.
Franco-Salvador, M.; Rosso, P.; Navigli, R. A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, Gothenburg, Sweden, 26–30 April 2014.
Siniscalchi, S.M.; Reed, J.; Svendsen, T.; Lee, C.H. Exploiting context-dependency and acoustic resolution of universal speech attribute models in spoken language recognition. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010.
Yarmohammadi, M.; Ma, X.; Hisamoto, S.; Rahman, M.; Wang, Y.; Xu, H.; Povey, D.; Koehn, P.; Duh, K. Robust Document Representations for Cross-Lingual Information Retrieval in Low-Resource Settings. In Proceedings of the Machine Translation Summit XVII Volume 1: Research Track, MTSummit 2019, Dublin, Ireland, 19–23 August 2019; pp. 12–20.
Vulić, I.; De Smet, W.; Moens, M.F. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retr. 2013, 16, 331–368.
Vulić, I.; Moens, M.F. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015; pp. 363–372.
Ruder, S.; Vulić, I.; Søgaard, A. A Survey of Cross-lingual Word Embedding Models. J. Artif. Intell. Res. 2019, 65, 569–631.
Bonab, H.; Sarwar, S.M.; Allan, J. Training effective neural CLIR by bridging the translation gap. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, China, 25–30 July 2020; pp. 9–18.
Glavaš, G.; Litschko, R.; Ruder, S.; Vulic, I. How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 710–721.
Conneau, A.; Lample, G.; Marc’Aurelio, R.; Denoyer, L.; Jégou, H. Word translation without parallel data. arXiv 2018, arXiv:1710.04087.
Wada, T.; Iwata, T.; Matsumoto, Y. Unsupervised multilingual word embedding with limited resources using neural language models. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019; pp. 3113–3124.
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451.
Smith, S.L.; Turban, D.H.; Hamblin, S.; Hammerla, N.Y. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv 2017, arXiv:1702.03859.
Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting Similarities among Languages for Machine Translation. arXiv 2013, arXiv:1309.4168.
Litschko, R.; Glavaš, G.; Ponzetto, S.P.; Vulić, I. Unsupervised cross-lingual information retrieval using monolingual data only. arXiv 2018, arXiv:1805.00879.
Zhang, M.; Liu, Y.; Luan, H.; Sun, M. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 1959–1970.
Blei, D.M.; Ng, A.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022.
Chan, C.-H.; Zeng, J.; Wessler, H.; Jungblut, M.; Welbers, K.; Bajjalieh, J.; van Atteveldt, W.; Althaus, S.L. Reproducible Extraction of Cross-lingual Topics (rectr). Commun. Methods Meas. 2020, 14, 285–305.
Zhang, D.; Mei, Q.; Zhai, C. Cross-Lingual Latent Topic Extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010.
Hao, S.; Paul, M.J. Learning Multilingual Topics from Incomparable Corpora. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, NM, USA, 20–26 August 2018.
Piccardi, T.; West, R. Crosslingual Topic Modeling with WikiPDA. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 3032–3041.
Mimno, D.; Wallach, H.M.; Naradowsky, J.; Smith, D.A.; McCallum, A. Polylingual Topic Models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, 6–7 August 2009.
Ni, X.; Sun, J.T.; Hu, J.; Chen, Z. Mining multilingual topics from wikipedia. In Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, 20–24 April 2009.
Fukumasu, K.; Eguchi, K.; Xing, E.P. Symmetric Correspondence Topic Models for Multilingual Text Analysis. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012.
Zhang, T.; Liu, K.; Zhao, J. Cross Lingual Entity Linking with Bilingual Topic Model. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Wei Wei

View Times: 282

Update Date: 24 Feb 2023

Table of Contents

Video Upload Options

Confirm

1. Introduction

2. Cross-Lingual Document Retrieval

References