1000/1000
Hot
Most Recent
In information theory, MostFreqKDistance is a string metric technique for quickly estimating how similar two ordered sets or strings are. The scheme was invented by Sadi Evren Seker (2014), and initially used in text mining applications like author recognition. The method is originally based on a hashing function MaxFreqKChars classical author recognition problem and idea first came out while studying on data stream mining. The algorithm is suitable for coding in any turing complete programming language.
Method has two steps.
The first step of algorithm is calculating the hashing based on the most frequent k characters. The hashing algorithm has below steps:
String function MostFreqKHashing (String inputString, int K) def string outputString for each distinct character count occurrence of each character for i := 0 to K char c = next most freq ith character (if two chars have same frequency then get the first occurrence in inputString) int count = number of occurrence of the character append to outputString, c and count end for return outputString
Above function, simply gets an input string and an integer K value and outputs the most frequent K characters from the input string. The only condition during the creation of output string is adding the first occurring character first, if the frequencies of two characters are equal. Similar to the most of hashing functions, Most Frequent K Hashing is also a one way function.
The second step of algorithm works on two outputs from two different input strings and outputs the similarity coefficient (or distance metric).
int function MostFreqKSimilarity (String inputStr1, String inputStr2, int limit) def int similarity for each c = next character from inputStr1 lookup c in inputStr2 if c is null continue similarity += frequency of c in inputStr1 return limit - similarity
Above function, simply gets two input strings, previously outputted from the MostFreqKHashing
function. From the most frequent k hashing function, the characters and their frequencies are returned. So, the similarity function calculates the similarity based on characters and their frequencies by checking if the same character appears on both strings. The limit is usually taken to be 10 and in the end the function returns the result of the subtraction of the sum of similarities from limit.
In some implementations, the distance metric is required instead of similarity coefficient. In order to convert the output of above similarity coefficient to distance metric, the output can be subtracted from any constant value (like the maximum possible output value). For the case, it is also possible to implement a wrapper function over above two functions.
In order to calculate the distance between two strings, the below function can be implemented
int function MostFreqKSDF (String inputStr1, String inputStr2, int K, int maxDistance) return MostFreqKSimilarity(MostFreqKHashing(inputStr1, K), MostFreqKHashing(inputStr2, K), maxDistance)
Any call to above string distance function will supply two input strings and a maximum distance value. The function will calculate the similarity and subtract that value from the maximum possible distance. It can be considered as a simple additive inverse of similarity.
Let's consider maximum 2 frequent hashing over two strings ‘research’ and ‘seeking’. MostFreqKHashing('research', 2) = r2e2 because we have 2 'r' and 2 'e' characters with the highest frequency and we return in the order they appear in the string. MostFreqKHashing('seeking', 2) = e2s1 Again we have character 'e' with highest frequency and rest of the characters have same frequency of 1, so we return the first character of equal frequencies, which is 's'. Finally we make the comparison: MostFreqKSimilarity('r2e2', 'e2s1') = 2 We simply compared the outputs and only character occurring in both input is character 'e' and the occurrence in both input is 2. Instead running the sample step by step as above, we can simply run by using the string distance wrapper function as below: MostFreqKSDF('research', 'seeking', 2) = 2
Below table holds some sample runs between example inputs for K=2:
Inputs | Hash outputs | SDF output (max from 10) |
---|---|---|
'night' 'nacht' |
n1i1 n1a1 |
9 |
'my' 'a' |
m1y1 a1NULL0 |
10 |
‘research’ ‘research’ |
r2e2 r2e2 |
6 |
‘aaaaabbbb’ ‘ababababa’ |
a5b4 a5b4 |
1 |
‘significant’ ‘capabilities’ |
i3n2 i3a2 |
7 |
Method is also suitable for bioinformatics to compare the genetic strings like in FASTA format.
The motivation behind algorithm is calculating the similarity between two input strings. So, the hashing function should be able to reduce the size of input and at the same time keep the characteristics of the input. Other hashing algorithms like MD5 or SHA-1, the output is completely unrelated with the input and those hashing algorithms are not suitable for string similarity check.
On the other hand, string similarity functions like Levenshtein distance have the algorithm complexity problem.
Also algorithms like Hamming distance, Jaccard coefficient or Tanimoto coefficient have relatively low algorithm complexity but the success rate in text mining studies are also low.
The calculation of time complexity of 'most frequent k char string similarity' is quite simple. In order to get the maximum frequent K characters from a string, the first step is sorting the string in a lexiconical manner. After this sort, the input with highest occurrence can be achieved with a simple pass in linear time complexity. Since major classical sorting algorithms are working in O(nlogn) complexity like merge sort or quick sort, we can sort the first string in O(nlogn) and second string on O(mlogm) times. The total complexity would be O(nlog n ) + O (m log m) which is O(n log n) as the upper bound worst case analysis.
Below table compares the complexity of algorithms:
Algorithm | Time complexity |
---|---|
Levenshtein distance | [math]\displaystyle{ O(nm) = O(n^2) }[/math] |
Jaccard index | [math]\displaystyle{ O(n+m) = O(n) }[/math] |
MostFreqKSDF | [math]\displaystyle{ O(n \log n + m \log m) = O(n \log n) }[/math] |
For the above table, n is the length of first string and m is the length of second string.
The success of string similarity algorithms are compared on a study. The study is based on IMDB62 dataset which is holding 1000 comment entries in Internet Movie Database from each 62 people. The data set is challenged for three string similarity functions and the success rates are as below:
Algorithm | Running time | Error (RMSE) | Error (RAE) |
---|---|---|---|
Levenshtein distance | 3647286.54 sec | 29 | 0.47 |
Jaccard index | 228647.22 sec | 45 | 0.68 |
MostFreqKSDF | 2712323.51 sec | 32 | 0.49 |
The running times for author recognition are in seconds and the error rates are root mean square error (RMSE) and relative absolute error (RAE).
Above table shows, the 'most frequent k similarity' is better than Levenshtein distance by time and Jaccard index by success rate.
For the time performance and the success rates, the bitwise similarity functions like Sørensen–Dice index, Tversky index or Hamming distance are all in the same category with similar success rates and running times. There are obviously slight differences but the idea behind bitwise operation, loses the string operations like deletion or addition. For example, a single bit addition to the front of one of the input strings would yield a catastrophic result on the similarity for bitwise operators while Levenshtein distance is successfully catching.
Unfortunately, big data studies often require a faster algorithm that still has an acceptable success rate. In such situations, the 'max frequent k characters' is a conceptually simpler algorithm that is also straight forward to implement.