Counting Bloom filter: History Edit

Counting Bloom filter is an efficient data structure to test quickly whether the count number of given element in a given sequence is smaller than a given threshold number or not. Counting Bloom filter occupies relatively small memory size compared to the sequence, nevertheless if Counting Bloom filter gives answer that 'the count number of element is smaller than the threshold', the answer is always right. However, Counting Bloom filter may say 'the count number is not smaller than the threshold' even if it is smaller in reality. Fortunately, the probability of wrong answer can be minimized by adjusting Counting Bloom filter architecture. In this entry, mathematical analysis about minimizing the wrong answer is presented with the detailed explanation about Counting Bloom filter.

  • Counting Bloom Filter
  • false positive
  • analysis

Let's assume a circumstance that there is a sequence E with size n, and there are many queries to figure out whether the count number of a given element appearing in the sequence is smaller than a given threshold θ or not. To manage the queries efficiently, constructing a database and handling the queries with SQL would be preferred. At first, an empty table would be created through SQL command such as

CREATE TABLE table(Id int PRIMARY KEY, Element NOT NULL, Count int).

Next, to accumulate the count numbers, for all elements e in E, insertion SQL command would be operated such as

INSERT OR REPLACE INTO table VALUES ( COALESCE((SELECT e FROM table WHERE Element=e), (SELECT MAX(Id) FROM Data)+1, 1), e, COALESCE( (SELECT Count FROM table WHERE Element=e), 0) + 1).

Finally, for each query with an element e, the exact count number can be presented with proper SQL select command such as

SELECT Count FROM table WHERE Element = e.

Using the exact number, the answer of the query would be received.

The computational time for handling a query is at least , which can be nearly  when the duplication among elements is small. However, even if the duplication is small, the query can be handled in smaller time using Counting Bloom Filter. Counting Bloom filter (CBF) data structure was proposed first by Fan in 2000[1]. CBF consists of m counters and k hash functions,  where . Usually, m is relatively smaller than n. Each hash function maps an element in the set to a CBF counters. At first, all counters are initialized to 0. Next, for each element e in E,  where . Finally, for an element e, if  where , the count number of e is definitely smaller than the threshold θ. According to statistics term, CBF always gives true negative. On the other hand, if there is a \(h_i(e)\) such that , CBF says e is not smaller than θ, however the answer can be wrong. With statistics term, this is called false positive. CBF returns the answer in O(k) time, which is smaller than .

The bigger m is, the smaller false positive is. An extreme case, if m=MAX(Id) and k=1, false positive becomes 0. However, to satisfy the system requirement, m is fixed with smaller than n. Therefore, the problem is finding the optimal k value,  for the lowest false positive when m and n is fixed. According to previous researches about CBF such as [2], the false positive with threshold θ is

,

where b means binomial distribution, . In the specific form of CBF with θ=1,  is known as .

Previous researches concludes that the optimal k value for general CBF is same as above k value. However, this k value minimizes , not . Therefore, there needs the  for general θ. This paper shows how to find the optimal k value. This entry briefly reviews the paper. 

A sure method for finding  is taking derivative of . However, Binomial distribution is quite complicated to find derivative. So approximating binomial distribution can be helpful. Fortunately, CBF is used when n and m are very large values. This induces binomial distribution can be approximated to poisson distribution . By the way, cumulative distribution function of poisson distribution can be expressed using incomplete gamma function. Therefore,  becomes,

where the incomplete gamma function is defined as  for the real positive value κ. From some manipulation of above equation, it can be shown that  decreases from k=1 to k=, and increases from k= to k=, which means there is only one  which minimize . The closed form of  is not shown easily, but  can be also shown. From the false positive shape and interval of , can be found by numerical analysis. This paper shows detailed analysis and actual numerical values. According to the result in this paper,  or  is fit for use in  values for θ<30 are also provided, satisfying .

 

References

  1. Li Fan; Pei Cao; J. Almeida; A.Z. Broder; Summary cache: a scalable wide-area Web cache sharing protocol. IEEE/ACM Transactions on Networking 2000, 8, 281-293, 10.1109/90.851975.
  2. Sasu Tarkoma; Christian Esteve Rothenberg; Eemil Lagerspetz; Theory and Practice of Bloom Filters for Distributed Systems. IEEE Communications Surveys & Tutorials 2012, 14, 131-155, 10.1109/SURV.2011.031611.00024.
  3. This paper shows how to find the optimal k value. . MDPI. Retrieved 2019-7-16
More