PyProbables
pyprobables is a pure-python library for probabilistic data structures. The goal is to provide the developer with a pure-python implementation of common probabilistic data-structures to use in their work.
To achieve better raw performance, it is recommended supplying an alternative hashing algorithm that has been compiled in C. This could include using the md5 and sha512 algorithms provided or installing a third party package and writing your own hashing strategy. Some options include the murmur hash mmh3 or those from the pyhash library. Each data object in pyprobables makes it easy to pass in a custom hashing function.
Read more about how to use Supplying a pre-defined, alternative hashing strategies or Defining hashing function using the provided decorators.
Installation
Pip Installation:
$ pip install pyprobables
To install from source:
To install pyprobables, simply clone the repository on GitHub, then run from the folder:
$ python setup.py install
pyprobables supports python 3.6 - 3.11+
For python 2.7 support, install release 0.3.2
$ pip install pyprobables==0.3.2
API Documentation
The documentation of is hosted on readthedocs.io
You can build the documentation locally by running:
$ pip install sphinx
$ cd docs/
$ make html
Automated Tests
To run automated tests, one must simply run the following command from the downloaded folder:
$ python setup.py test
Quickstart
Import pyprobables and setup a Bloom Filter
from probables import BloomFilter
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05)
blm.add('google.com')
blm.check('facebook.com') # should return False
blm.check('google.com') # should return True
Import pyprobables and setup a Count-Min Sketch
from probables import CountMinSketch
cms = CountMinSketch(width=1000, depth=5)
cms.add('google.com') # should return 1
cms.add('facebook.com', 25) # insert 25 at once; should return 25
Import pyprobables and setup a Cuckoo Filter
from probables import CuckooFilter
cko = CuckooFilter(capacity=100, max_swaps=10)
cko.add('google.com')
cko.check('facebook.com') # should return False
cko.check('google.com') # should return True
Import pyprobables and setup a Quotient Filter
from probables import QuotientFilter
qf = QuotientFilter(quotient=24)
qf.add('google.com')
qf.check('facebook.com') # should return False
qf.check('google.com') # should return True
Supplying a pre-defined, alternative hashing strategies
from probables import BloomFilter
from probables.hashes import default_sha256
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05,
hash_function=default_sha256)
blm.add('google.com')
blm.check('facebook.com') # should return False
blm.check('google.com') # should return True
Defining hashing function using the provided decorators
import mmh3 # murmur hash 3 implementation (pip install mmh3)
from probables.hashes import hash_with_depth_bytes
from probables import BloomFilter
@hash_with_depth_bytes
def my_hash(key, depth):
return mmh3.hash_bytes(key, seed=depth)
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)
import hashlib
from probables.hashes import hash_with_depth_int
from probables.constants import UINT64_T_MAX
from probables import BloomFilter
@hash_with_depth_int
def my_hash(key, seed=0, encoding="utf-8"):
max64mod = UINT64_T_MAX + 1
val = int(hashlib.sha512(key.encode(encoding)).hexdigest(), 16)
val += seed # not a good example, but uses the seed value
return val % max64mod
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)
See the API documentation for other data structures available and the quickstart page for more examples!
Changelog
Please see the changelog for a list of all changes.
Backward Compatible Changes
If you are using previously exported probablistic data structures (v0.4.1 or below) and used the default hashing strategy, you will want to use the following code to mimic the original default hashing algorithm.
from probables import BloomFilter
from probables.hashes import hash_with_depth_int
@hash_with_depth_int
def old_fnv1a(key, depth=1):
return tmp_fnv_1a(key)
def tmp_fnv_1a(key):
max64mod = UINT64_T_MAX + 1
hval = 14695981039346656073
fnv_64_prime = 1099511628211
tmp = map(ord, key)
for t_str in tmp:
hval ^= t_str
hval *= fnv_64_prime
hval %= max64mod
return hval
blm = BloomFilter(filpath="old-file-path.blm", hash_function=old_fnv1a)
- pyprobables API
- Data Structures and Classes
- Bloom Filters
- BloomFilter
BloomFilter
BloomFilter.add()
BloomFilter.add_alt()
BloomFilter.bloom
BloomFilter.bloom_length
BloomFilter.check()
BloomFilter.check_alt()
BloomFilter.clear()
BloomFilter.current_false_positive_rate()
BloomFilter.elements_added
BloomFilter.estimate_elements()
BloomFilter.estimated_elements
BloomFilter.export()
BloomFilter.export_c_header()
BloomFilter.export_hex()
BloomFilter.export_size()
BloomFilter.false_positive_rate
BloomFilter.frombytes()
BloomFilter.hash_function
BloomFilter.hashes()
BloomFilter.intersection()
BloomFilter.is_on_disk
BloomFilter.jaccard_index()
BloomFilter.number_bits
BloomFilter.number_hashes
BloomFilter.union()
- BloomFilterOnDisk
- ExpandingBloomFilter
ExpandingBloomFilter
ExpandingBloomFilter.add()
ExpandingBloomFilter.add_alt()
ExpandingBloomFilter.check()
ExpandingBloomFilter.check_alt()
ExpandingBloomFilter.elements_added
ExpandingBloomFilter.estimated_elements
ExpandingBloomFilter.expansions
ExpandingBloomFilter.export()
ExpandingBloomFilter.false_positive_rate
ExpandingBloomFilter.frombytes()
ExpandingBloomFilter.hash_function
ExpandingBloomFilter.push()
- RotatingBloomFilter
RotatingBloomFilter
RotatingBloomFilter.add()
RotatingBloomFilter.add_alt()
RotatingBloomFilter.check()
RotatingBloomFilter.check_alt()
RotatingBloomFilter.current_queue_size
RotatingBloomFilter.elements_added
RotatingBloomFilter.estimated_elements
RotatingBloomFilter.expansions
RotatingBloomFilter.export()
RotatingBloomFilter.false_positive_rate
RotatingBloomFilter.frombytes()
RotatingBloomFilter.hash_function
RotatingBloomFilter.max_queue_size
RotatingBloomFilter.pop()
RotatingBloomFilter.push()
- CountingBloomFilter
CountingBloomFilter
CountingBloomFilter.add()
CountingBloomFilter.add_alt()
CountingBloomFilter.bloom
CountingBloomFilter.bloom_length
CountingBloomFilter.check()
CountingBloomFilter.check_alt()
CountingBloomFilter.clear()
CountingBloomFilter.current_false_positive_rate()
CountingBloomFilter.elements_added
CountingBloomFilter.estimate_elements()
CountingBloomFilter.estimated_elements
CountingBloomFilter.export()
CountingBloomFilter.export_c_header()
CountingBloomFilter.export_hex()
CountingBloomFilter.export_size()
CountingBloomFilter.false_positive_rate
CountingBloomFilter.frombytes()
CountingBloomFilter.hash_function
CountingBloomFilter.hashes()
CountingBloomFilter.intersection()
CountingBloomFilter.is_on_disk
CountingBloomFilter.jaccard_index()
CountingBloomFilter.number_bits
CountingBloomFilter.number_hashes
CountingBloomFilter.remove()
CountingBloomFilter.remove_alt()
CountingBloomFilter.union()
- BloomFilter
- Cuckoo Filters
- CuckooFilter
CuckooFilter
CuckooFilter.add()
CuckooFilter.auto_expand
CuckooFilter.bucket_size
CuckooFilter.buckets
CuckooFilter.capacity
CuckooFilter.check()
CuckooFilter.elements_added
CuckooFilter.error_rate
CuckooFilter.expand()
CuckooFilter.expansion_rate
CuckooFilter.export()
CuckooFilter.fingerprint_size
CuckooFilter.fingerprint_size_bits
CuckooFilter.frombytes()
CuckooFilter.init_error_rate()
CuckooFilter.load_error_rate()
CuckooFilter.load_factor()
CuckooFilter.max_swaps
CuckooFilter.remove()
- CountingCuckooFilter
CountingCuckooFilter
CountingCuckooFilter.add()
CountingCuckooFilter.auto_expand
CountingCuckooFilter.bucket_size
CountingCuckooFilter.buckets
CountingCuckooFilter.capacity
CountingCuckooFilter.check()
CountingCuckooFilter.elements_added
CountingCuckooFilter.error_rate
CountingCuckooFilter.expand()
CountingCuckooFilter.expansion_rate
CountingCuckooFilter.export()
CountingCuckooFilter.fingerprint_size
CountingCuckooFilter.fingerprint_size_bits
CountingCuckooFilter.frombytes()
CountingCuckooFilter.init_error_rate()
CountingCuckooFilter.load_error_rate()
CountingCuckooFilter.load_factor()
CountingCuckooFilter.max_swaps
CountingCuckooFilter.remove()
CountingCuckooFilter.unique_elements
- CuckooFilter
- Count-Min Sketches
- CountMinSketch
CountMinSketch
CountMinSketch.add()
CountMinSketch.add_alt()
CountMinSketch.check()
CountMinSketch.check_alt()
CountMinSketch.clear()
CountMinSketch.confidence
CountMinSketch.depth
CountMinSketch.elements_added
CountMinSketch.error_rate
CountMinSketch.export()
CountMinSketch.frombytes()
CountMinSketch.hashes()
CountMinSketch.join()
CountMinSketch.query_type
CountMinSketch.remove()
CountMinSketch.remove_alt()
CountMinSketch.width
- CountMeanSketch
- CountMeanMinSketch
- HeavyHitters
- StreamThreshold
- CountMinSketch
- QuotientFilter
- Utilities
- Bloom Filters
- Exceptions
- Hashing Functions
- Indices and Tables
- Quickstart
- Indices and Tables