2.1. Detailed Literature Review
The present section aims to perform a more detailed discussion of the different address matching algorithms based on the full text of the selected articles (also summarized in
Appendix A), with a view to extend the previously presented keyword-based analysis. This more detailed literature review will be organized around the three main methods which have been found to be the most relevant: string similarity-based methods, address element-based methods, and deep learning methods [
9].
String similarity-based methods consist of a standard approach for address matching and generally involve the computation of a similarity metric between the addresses under comparison. Three main methods can be identified: character-based, vector-space based, and hybrid approaches [
44]. Character-based methods comprehend edition operations, like sub-sequence comparisons, deletions, insertions, and substitutions. One of the best-known character-based methods is the Levenshtein edit distance metric [
45], consisting of the minimum number of insertions, substitutions, or deletions which are required to convert a string into another (for instance, the edit distance between the toponyms Lisboa and Lisbonne is three, since it requires two insertions and one substitution) [
44]. Another example of a character-based method is the Jaro metric [
46], specifically conceived for matching short strings, like person names, with a more advanced version being latter proposed (Jaro–Winkler similarity) [
47], in order to give higher scores to strings matching from the beginning up to a given prefix length. Regarding vector-space approaches, the calculation of the cosine similarity between representations based on character n-grams (i.e., sequences of n consecutive characters) consists of a common approach, alongside the Jaccard similarity coefficient [
44]. Lastly, hybrid metrics, while combining the advantages of the two previous approaches, also allow for small differences in word tokens and are more flexible in what concerns to word order and position [
44]. Nevertheless, in terms of performance, there is not a best technique. The available metrics are task-dependent and, according to the study developed by Santos et al. [
44], involving the comparison of thirteen different string similarity metrics, the differences in terms of performance are not significant, even when combined with supervised methods, to avoid the manual tuning of the decision threshold (one of the most important factors to obtain good results).
Address element-based methods, on their turn, rely on address parsing, a sequence tagging task which has been traditionally approached using probabilistic methods mainly based on Hidden Markov Models (HMM) and Conditional Random Fields (CRF) [
3], alongside other less common approaches not always involving machine learning methods.
In what concerns the application of HMMs in the context of residential addresses, the hidden states correspond to each segment of the address and the observations consist of the tokens assigned to each word of the input address string (after the application of some cleaning procedures), which may be based on look-up tables and hard-coded rules [
6]. For instance, the address “17 Epping St Smithfield New South Wales 2987”, after cleaning and tokenization, would turn into the following:
One of the main drawbacks of traditional HMMs is the fact that they do not support multiple simultaneous observations for one token. Even in more advanced versions of HMMs such as entropy Markov Models [
49], in which the current state depends both on the previous state and on existing observations, there is a weakness called the label bias problem [
50]: “transitions leaving a given state to compete only against each other, rather than against all transitions in the model” [
41] (p. 2).Within the present literature review, four of the considered articles propose HMM-based methods: the already mentioned one by Churches et al. [
6], aiming at the preparation of name and address data for record linkage purposes, through a combined approach using lexicon-based tokenization and HMMs, with the obtained experimental results confirming it as an alternative to rule-based systems that is both feasible and cost-effective; a second paper by the same authors [
51], in which a geocoding system based on HMMs and a rule-based matching engine (
Febrl) for spatial data analysis is proposed and tested on small datasets of randomly selected addresses from different sources, with experimental results pointing to exact matches rates between 89% and 94%, depending on the source and considering the total exact matches obtained at various levels (address level, street level and locality level); the paper by X. Li et al. [
40], in which an HMM-based large scale address parser is proposed, obtaining an accuracy of 95.6% (F-measure), after being tested on data from various sources with varying degrees of quality and containing billions of registers, of which 20% were synthetically rearranged in order to reproduce normal address variations; and, finally, the paper by Fu et al. [
52], in which an HMM-based segmentation and recognition algorithm is proposed for the development of automatic mail-sorting systems involving handwritten Chinese characters (a problem which will be further addressed in the present literature review), with experimental results confirming its effectiveness.
Conditional Random Fields (CRFs) consist of a recent innovation in the field of text segmentation. CRFs are conditional by nature and assume no independence between output labels, illustrating real world addresses, in which zip codes, for instance, are related to city names, localities, and even streets [
3]. Having all the advantages of Maximum entropy Markov models (MEMMs), CRFs also solve the label bias problem by letting the probability of a transition between labels also depend on past and future elements and not only on the current address element [
3]. “The essential difference between CRFs and MEMMs is that the underlying graphical model structure of CRFs is undirected, while that of MEMMs is directed” [
41] (p. 2). Considering, as an example, the address “3B Records, 5 Slater Street, Liverpool L1 4BW”, an HMM parser would erroneously predict the first and second labels as standing for number (‘3B’) and street (‘Records’), respectively, whereas the CRF parser, when reaching the actual property number (5), would give a higher score to the current label in order to revise it to a property number and the previous label (3B Records) to a business name [
3]. Another recent approach to address parsing is based on so-called “word-embeddings”, the name given to the vector representation of words [
3]. An implementation of such method is word2vec [
53], an unsupervised neural network language which aims to make predictions about the next words by modeling the relationships between a given word and the words in its context, based on two possible architectures: the continuous skip-gram model (Skip-Gram) and the continuous bag-of-words model (CBOW) [
53]. The latter is usually chosen over the former, since it is trained by inferring the meaning of a particular word from its context [
9].
A practical comparison between HMMs, CRFs, and a CRF augmentation with word2vec is undertaken in Comber and Arribas-Bel [
3]. The VGI based
Libpostal library (
https://github.com/openvenues/libpostal (accessed on 9 November 2021)), which trains a CRF model on 1 billion street addresses from OSM data, was used for the segmentation task. Although the obtained results are broadly consistent in terms of precision, the classifiers using the HMM technique present lower recall values than the ones obtained by the CRF, meaning that both methods are capable of distinguishing true positives from false positives, but the CRF is able to classify a greater proportion of matches [
3]. The augmented version of the CRF model does not outperform the results obtained by the original one but presents the advantage of not committing the user to a particular string distance and its biases [
3]. In another recent work by the same author [
54], a predictive model for address matching is proposed, based on recent innovations in machine learning and on a CRF parser for the segmentation of address strings. The biggest contribution of the paper at hand, however, is the thorough documentation of all the steps required to execute the proposed model’s workflow. In other papers included in the present literature review, CRFs are used as a benchmark model, e.g.,: Dani et al. [
55] or in combination with other methods, which will be further addressed [
56,
57,
58].
Other less recent approaches have been proposed for address parsing/segmentation, namely within address standardization studies aiming at minimizing the size of labelled training data. One such example is the work by Kothari et al. [
59], in which a nonparametric Bayesian approach to clustering grouped data, known as hierarchical Dirichlet process (HDP) [
60], is used with a view to discover latent concepts representing common semantic elements across different sources and allow the automatic transfer of supervision from a labeled source to an unlabeled one. The obtained latent clusters are used to segment and label address registers in an adapted CRF classifier, with experimental results pointing to a considerable improvement in classification accuracy [
59]. A similar approach is proposed by Guo et al. [
61], in which paper a supervised address standardization method with latent semantic association (LaSA) is presented, with a view to capture latent semantic association among words in the same domain. The obtained experimental results show that the performance of standardization is significantly improved by the proposed method. Expert systems have also been proposed, namely by Dani et al. [
55], in which paper a Ripple Down Rules (RDR) framework is proposed with a view to enable a cost-effective migration of data cleansing algorithms between different datasets. RDR allows the incremental modification of rules and to add exceptions without unwanted side effects, based on a failure driven approach in which a rule is only added when the existing system fails to classify an instance [
55]. After comparison with traditional machine learning algorithms and a commercial system, experimental results show that the RDR approach requires significantly less rules and training examples to reach the same accuracy as the former methods [
55].
Tree-based models have been proposed to handle automatic handwritten address recognition, which consists of a particular address parsing/segmentation task mostly studied by Chinese researchers, due to the greater complexity of the Chinese language (larger character set, different writing styles, great similarity between many of the characters) [
62]. In the paper by Jiang et al. [
62], a suffix tree is proposed to store and access addresses from any character. In relation to previous approaches also based on a tree data structure, the proposed suffix tree is able to deal with noise and address format variations. Basically, a hierarchical substring list is firstly built, after which the obtained input radicals are compared with candidate addresses (filtered by the postcode) with a view to optimize a cost function, combining both recognition and matching accuracy [
62]. A correct classification rate of 85.3% is obtained in the experimental results. However, according to Wei et al. [
22], the recognition accuracy of character-level-tree (CLT) models is dependent on the completeness of the address list on which they are based. In order to overcome this limitation, the authors propose a structure tree built at the word level (WLT), in which each node consists of an address word and the path from the roof to the leaf corresponds to a standardized address format. After initial recognition by a character classifier, segment candidate patterns are mapped to candidate address words based on the WLT database. In the final phase (path matching), candidate address words’ scores are summed in order to obtain the address recognition result [
22]. The obtained experimental results show that the proposed method outperforms four benchmarking methods, including the previously mentioned suffix tree. Address tree models for address parsing and standardization are also proposed in the papers by Tian et al. [
63], Liu et al. [
64] and Li et al. [
65]. In the first two, the address tree model is mainly used for rule-based validation and error recognition, by providing information about the hierarchy of Chinese addresses and, in the case of the paper by Li et al. [
65], latent tree structures are designed with a view to capture rich dependencies between the final segments of an address, which do not always follow the same order.
Within the address element-based methods, it is also worth highlighting geocoding as a means to enhance address standardization, through the correction of misspellings and the filling of missing attributes, some of the most common errors found in postal addresses [
66]. After successful matching with a record from a standardized reference database (like Google Maps or OSM), reverse geocoding can be performed to obtain a valid and complete representation of the queried address. In the case of geocoded databases like GNAF, for instance [
51], geographic coordinates can be used to calculate the spatial proximity between different records for conducting distance-based spatial analyses and for record linking purposes between different databases (up to the house number). Another important application of address geocoding relates to the matching of historical address records (such as census records) with contemporary data, by attaching grid references to the former in order to perform longitudinal spatial demographic analyses [
27]. However, the successful automated geocoding of residential addresses depends on a number of factors, namely population densities (with positional error increasing as population density decreases) [
27,
67], the completeness of an address (existence or not of a number and street name), and changes in street names, among others [
27]. These limitations can be tackled by the previous standardization and enrichment of addresses [
68] and the choice of the most adequate geocoding method, including the use of property data [
67] or the use of hybrid geocoding approaches [
28].
With the advancement of deep learning methods, various authors have recently proposed the adoption of the previously mentioned extensions to RNNs (namely, LSTMs, GRUs, and GCNs), in order to better cope with nonstandard address records and highly dissimilar toponyms. LSTMs and GRUs are both composed of gates, which consist of neural networks that regulate the flow of information from one time step to the next, thereby helping to solve the short memory problem. In particular, GRUs have two gates—update and reset gates—and LSTMs, three gates—input, forget, and output gates [
30]. The amount of fresh information added through the input gate in LSTMs is unrelated to the amount of information retained through the forget gate. In GRUs, the retention of past memory and the input of new information to memory are not independent [
30].
Within the present literature review, several of the considered papers propose these types of methods, namely the ones by Santos et al. [
35], Lin et al. [
9], J. Liu et al. [
58], Shan et al. [
7,
29], P. Li et al. [
69], and Chen et al. [
70]. To take into account contextual information both from previous and future tokens, by processing the sequence in two directions, bidirectional LSTM (BiLSTM) or GRU layers are also being employed in the great majority of these studies. The best performing models further connect the encoder and decoder through an attention mechanism in order to assign higher weights to the most important features [
7,
9,
29]. With a view to reduce overfitting and enhance the classification models’ generalization abilities, a dropout regularization layer is also normally added [
9,
35,
58]. The ESIM model [
71] consists of an illustrative example of a deep learning architecture based on the principles previously described. After address tokenization (with the help of gazetteers and dictionaries, in the case of more complex languages, with no natural separators) and the obtaining of vector representations of the different (labelled) address pairs (based on word2vec), the ESIM model is employed through the following four layers [
9]:
-
An input encoding layer, that encodes the input address vectors and extracts higher-level representations using the bidirectional long short-term memory (BiLSTM) model;
-
A local inference modelling layer, that makes local inference of an address pair using a modified decomposable attention model [
72];
-
An inference composition layer, responsible for making a global inference between two compared address records based on their local inference, in which average and max pooling are used to summarize the local inference and output a final vector with a fixed length;
-
Finally, a prediction layer, based on a multilayer perceptron (MLP) composed of three fully connected layers with rectified linear unit (ReLU), tanh and softmax activation functions, is used to output the predictive results of address pairs (that is, whether there is a match or not).
In terms of performance, all of the previously presented deep learning methods achieve a greater matching accuracy than the traditional text-matching models. In the case of the BiLSTM model proposed by Lin et al. [
9], the precision, recall, and F1 score on the test set all reached 0.97, against the 0.92 scores achieved by the second-best performing model (Jaccard similarity coefficient + RF method). The deep neural network based on GRUs, to categorize toponym pairs as matches or non-matches, proposed by Santos et al. [
35], also outperforms traditional text-matching methods, achieving an increase of almost 10 points in most of the evaluation metrics (namely, accuracy, precision, and F1). The LM-LSTM-CRF+BP neural networks model proposed by J. Liu et al. [
58] achieves an accuracy and F1 score of 87%, compared with average scores of 70% by the benchmark methods (word2vec and edit distance). The address GCN method proposed by Shan et al. [
7] also presents better results, on both precision (up to 8%) and recall (up to 12%), than the existing methods, which include the DeepAM model previously proposed by the same author, based on an encoder-decoder architecture with two LSTM networks [
29]. The Bi-GRU neural network proposed by P. Li et al. [
69] presents a similar performance to that shown by a Bi-LSTM neural network (F1 score of 99%) and a higher performance than unidirectional GRU and LSTM neural networks (F1 score of 93%), as it would be expected. Finally, the attention-Bi-LSTM-CNN network (ABLC) proposed by Chen et al. [
70] achieves an improvement of 4–10% more accuracy than the baseline models, which include the previously mentioned ESIM model, presenting the second-best overall performance.
In two of the most recent studies included in the present literature review also based on deep learning methods, bidirectional encoder representations from Transformers (BERT) are proposed instead. The first one is the study by Xu et al. [
20], which proposes a method for fusing addresses and geospatial data based on BERT (in what concerns the learning of addresses’ semantics) and a K-Means high-dimensional clustering algorithm, enhanced by innovative fine-tuning techniques, to create a geospatial-semantic address model (GSAM). The computational representation extracted from GSAM is then employed for predicting address location, based on a neural network architecture to minimize the Euclidean distance between the predicted and real coordinates. In the second study [
37], a new address element recognition method is proposed, for dealing with address elements with shortchange periods (streets, lanes, landmarks, points of interest names, etc.) which still have not been included in a segmentation dictionary. A model based on BERT is first applied to obtain the vector representations of the dataset and learn the contextual information and model address features, followed by the use of a CRF to predict the tags, with new address elements being recognized according to the tag [
37].
In terms of performance, the GSAM model [
20] achieves a classification accuracy above 0.97, against a minimum expected accuracy of 0.91 by other methods; the BERT-CRF model [
37] achieves the highest F1 score on generalization ability (0.78), when compared to benchmark models combining word2vec, BiLSTM, and CRF methods (with an average F1 score of 0.41), as well as an equally high F1 score on the testing dataset (0.95).
Although related to POIs’ locations and descriptions, two final articles (both published in 2021) are worth mentioning, due to the combined use of the previously presented approaches and spatial correlation/reasoning methods. The first of this studies [
2] presents a method for identifying POIs in large POI datasets in a quick and accurate manner, based on: an enhanced address-matching algorithm, combining string, semantic, and spatial similarity, within an ontology model describing POIs’ locations and relationships, in order to support the transition from semantic to spatial; a grid-based algorithm capable of achieving compact representations of vast qualitative direction interactions between POIs and performing quick spatial reasoning, through the fast retrieval of direction relations and quantitative calculations. The second of the studies [
8] proposes an unsupervised method to segment and standardize POIs’ addresses, based on a GRU neural network combined with the spatial correlation between address elements for the automatic segmentation task, and a tree-based fuzzy matching of address elements for the standardization task, with experimental results pointing to a relatively high accuracy.