Quickstart

Install

The easiest method of installing pyprobables is by using the pip package manager:

Pip Installation:

$ pip install pyprobables

API Documentation

The full API documentation for the pyprobables package: pyprobables API

Example Usage

Bloom Filters

Bloom Filters provide set operations of large datasets while being small in memory footprint. They provide a zero percent false negative rate and a predetermined, or desired, false positive rate. more information

Import, Initialize, and Train

>>> from probables import (BloomFilter)
>>> blm = BloomFilter(est_elements=1000000, false_positive_rate=0.05)
>>> with open('war_and_peace.txt', 'r') as fp:
>>>     for line in fp:
>>>         for word in line.split():
>>>             blm.add(word.lower())  # add each word to the bloom filter!
>>> # end reading in the file

Query the Bloom Filter

>>> words_to_check = ['step', 'borzoi', 'diametrically', 'fleches', 'rain']
>>> for word in words_to_check:
>>>     blm.check(word)

Export the Bloom Filter

>>> blm.export('war_and_peace_bloom.blm')

Import a Bloom Filter

>>> blm2 = BloomFilter(filepath='war_and_peace_bloom.blm')
>>> print(blm2.check('sutler'))

Other Bloom Filters

Bloom Filter on Disk

The Bloom Filter on Disk is a specialized version of the standard Bloom Filter that is run directly off of disk instead of in memory. This can be useful for very large Bloom Filters or when needing to access many Blooms that are exported to file.

Expanding Bloom Filter

The Expanding Bloom Filter is a specialized version of the standard Bloom Filter that automatically grows to ensure that the desired false positive rate is not exceeded. This is ideal for situations that it is a wild guess to determine the number of elements that will be added.

Rotating Bloom Filter

The Rotating Bloom Filter is a specialized version of the standard Bloom Filter that rolls of earlier entries into the filter as they become more stale. The popping of the queue can be done either programmatically or automatically.

Counting Bloom Filter

Counting Bloom Filters are another specialized version of the standard Bloom Filter. Instead of using a bit array to track added elements, a Counting Bloom uses integers to track the number of times the element has been added.

Count-Min Sketch

Count-Min Sketches, and its derivatives, are good for counting the number of occurrences of an element in streaming data while not needing to retain all the data elements. The result is a probabilistic count of elements inserted into the data structure. It will always provide a maximum number of times encountered. Notice that the result may be more than the true number of times it was inserted, but never fewer. more information

Import, Initialize, and Train

>>> from probables import (CountMinSketch)
>>> cms = CountMinSketch(width=100000, depth=5)
>>> with open('war_and_peace.txt', 'r') as fp:
>>>     for line in fp:
>>>         for word in line.split():
>>>             cms.add(word.lower())  # add each to the count-min sketch!

Query the Count-Min Sketch

>>> words_to_check = ['step', 'borzoi', 'diametrically', 'fleches', 'rain']
>>> for word in words_to_check:
>>>     print(cms.check(word))  # prints: 80, 17, 1, 20, 25

Export Count-Min Sketch

>>> cms.export('war_and_peace.cms')

Import a Count-Min Sketch

>>> cms2 = CountMinSketch(filepath='war_and_peace.cms')
>>> print(cms2.check('fleches'))  # prints 20

Other Count-Min Sketches

Count-Mean Sketch and Count-Mean-Min Sketch

Count-Mean Sketch and Count-Mean-Min Sketch are identical to the Count-Min Sketch for the data structure but both differ in the method of calculating the number of times and element has been inserted. These are currently supported by specifying at query time which method is desired or by initializing to the desired class: CountMeanSketch or CountMeanMinSketch.

Heavy Hitters

Heavy Hitters is a version of the Count-Min Sketch that tracks those elements that are seen most often. Beyond the normal initialization parameters one only needs to specify the number of heavy hitters to track.

Stream Threshold

Stream Threshold is another version of the Count-Min Sketch similar to the Heavy Hitters. The main difference is that the there is a threshold for including an element to be tracked instead of tracking a certain number of elements.

Cuckoo Filters

Cuckoo Filters are a memory efficient method to approximate set membership. They allow for the ability to add, remove, and look elements from the set. They get the name cuckoo filter from the use of the cuckoo hashing strategy.

Import, Initialize, and Train

>>> from probables import (CuckooFilter)
>>> ccf = CuckooFilter(capacity=100000, bucket_size=4, max_swaps=100)
>>> with open('war_and_peace.txt', 'r') as fp:
>>>     for line in fp:
>>>         for word in line.split():
>>>             ccf.add(word.lower())  # add each to the cuckoo filter!

Query the Cuckoo Filter

>>> words_to_check = ['borzoi', 'diametrically', 'fleches', 'rain', 'foo']
>>> for word in words_to_check:
>>>     print(ccf.check(word))  # prints: True, True, True, True, False

Export the Cuckoo Filter

>>> ccf.export('war_and_peace.cko')

Import a Cuckoo Filter

>>> ccf2 = CuckooFilter(filepath='war_and_peace.cko')
>>> print(ccf2.check('fleches'))  # prints True

Cuckoo Filters based on Error Rate

To use error rate to initialize a Cuckoo Filter, there are class methods that can be used. init_error_rate() can be used to initialize a Cuckoo Filter that has not been exported, and load_error_rate() can be used to load in a previously exported Cuckoo Filter that used error rate to determine the parameters.

>>> cko = CuckooFilter.init_error_rate(0.00001)
>>> cko.export('war_and_peace.cko')
>>> ckf = CuckooFilter.load_error_rate(0.00001)

Other Cuckoo Filters

Counting Cuckoo Filter

The counting cuckoo filter is similar to the standard filter except that it tracks the number of times a fingerprint has been added to the filter.

Quotient Filters

Quotient Filters provide set operations of large datasets while being relatively small in memory footprint. They provide a zero percent false negative rate and a small false positive rate. more information

Import, Initialize, and Train

>>> qf = QuotientFilter(quotient=22)
>>> with open('war_and_peace.txt', 'r') as fp:
>>>     for line in fp:
>>>         for word in line.split():
>>>             blm.add(word.lower())  # add each word to the bloom filter!

Query the Quotient Filter

>>> words_to_check = ['borzoi', 'diametrically', 'fleches', 'rain', 'foo']
>>> for word in words_to_check:
>>>     print(qf.check(word))  # prints: True, True, True, True, False

Custom Hashing Functions

In many instances, to get the best raw performance out of the data structures, it is wise to use a non pure python hashing algorithm. It is recommended that one is used that is compiled such as mmh3 or pyhash or even built in cryptographic hashes.

Some pre-defined hashing strategies are provided that use built in cryptographic hashes.

To use a pre-defined alternative hashing strategy:

>>> from probables import (BloomFilter)
>>> from probables.hashes import (default_sha256, default_md5)
>>> blm = BloomFilter(est_elements=1000, false_positive_rate=0.05,
                      hash_function=default_sha256)
>>> blm.add('google.com')
>>> blm.check('facebook.com')  # should return False
>>> blm.check('google.com')  # should return True

Decorators are provided to help make generating hashing strategies easier.

Defining hashing function using the provided decorators:

>>> import mmh3  # murmur hash 3 implementation (pip install mmh3)
>>> from pyprobables.hashes import (hash_with_depth_bytes)
>>> from pyprobables import (BloomFilter)
>>>
>>> @hash_with_depth_bytes
>>> def my_hash(key):
>>>     return mmh3.hash_bytes(key)
>>>
>>> blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)

>>> import mmh3  # murmur hash 3 implementation (pip install mmh3)
>>> from pyprobables.hashes import (hash_with_depth_int)
>>> from pyprobables import (BloomFilter)
>>>
>>> @hash_with_depth_int
>>> def my_hash(key, encoding='utf-8'):
>>>    max64mod = UINT64_T_MAX + 1
>>>    val = int(hashlib.sha512(key.encode(encoding)).hexdigest(), 16)
>>>    return val % max64mod
>>>
>>> blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)

Generate completely different hashing strategy

>>> import mmh3  # murmur hash 3 implementation (pip install mmh3)
>>>
>>> def my_hash(key, depth, encoding='utf-8'):
>>>     max64mod = UINT64_T_MAX + 1
>>>     results = list()
>>>     for i in range(0, depth):
>>>         tmp = key[i:] + key[:i]
>>>         val = int(hashlib.sha512(tmp.encode(encoding)).hexdigest(), 16)
>>>         results.append(val % max64mod)
>>>     return results

Quickstart

Install

API Documentation

Example Usage

Bloom Filters

Import, Initialize, and Train

Query the Bloom Filter

Export the Bloom Filter

Import a Bloom Filter

Other Bloom Filters

Bloom Filter on Disk

Expanding Bloom Filter

Rotating Bloom Filter

Counting Bloom Filter

Count-Min Sketch

Import, Initialize, and Train

Query the Count-Min Sketch

Export Count-Min Sketch

Import a Count-Min Sketch

Other Count-Min Sketches

Count-Mean Sketch and Count-Mean-Min Sketch

Heavy Hitters

Stream Threshold

Cuckoo Filters

Import, Initialize, and Train

Query the Cuckoo Filter

Export the Cuckoo Filter

Import a Cuckoo Filter

Cuckoo Filters based on Error Rate

Other Cuckoo Filters

Counting Cuckoo Filter

Quotient Filters

Import, Initialize, and Train

Query the Quotient Filter

Custom Hashing Functions

Indices and Tables