pyprobables API
Here you can find the full developer API for the pyprobables project. pyprobables provides a suite of probabilistic data-structures to be used in data analytics and data science projects.
Data Structures and Classes
Bloom Filters
Bloom Filters are a class of probabilistic data structures used for set operations. Bloom Filters guarantee a zero percent false negative rate and a predetermined false positive rate. Once the number of elements inserted exceeds the estimated elements, the false positive rate will increase over the desired amount.
BloomFilter
- class probables.BloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]
Simple Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)
- Parameters:
est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
filepath (str) – Path to file to load
hex_string (str) – Hex based representation to be loaded
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Bloom Filter object
- Return type:
Note
- Initialization order of operations:
From file
From Hex String
From params
- add(key)[source]
Add the key to the Bloom Filter
- Parameters:
key (str) – The element to be inserted
- Return type:
None
- add_alt(hashes)[source]
Add the element represented by hashes into the Bloom Filter
- Parameters:
hashes (list) – A list of integers representing the key to insert
- Return type:
None
- property bloom: array
The bit/int array
- Type:
list(int)
- property bloom_length: int
Length of the Bloom Filter array
Note
Not settable
- Type:
int
- check(key)[source]
Check if the key is likely in the Bloom Filter
- Parameters:
key (str) – The element to be checked
- Returns:
True if likely encountered, False if definately not
- Return type:
bool
- check_alt(hashes)[source]
Check if the element represented by hashes is in the Bloom Filter
- Parameters:
hashes (list) – A list of integers representing the key to check
- Returns:
True if likely encountered, False if definately not
- Return type:
bool
- current_false_positive_rate()[source]
Calculate the current false positive rate based on elements added
- Returns:
The current false positive rate
- Return type:
float
- property elements_added: int
Number of elements added to the Bloom Filter
Note
Changing this can cause the current false positive rate to be reported incorrectly
- Type:
int
- estimate_elements()[source]
Estimate the number of unique elements added
- Returns:
Number of elements estimated to be inserted
- Return type:
int
Note
Returns -1 if all bits in the Bloom filter are set
- property estimated_elements: int
The maximum number of elements estimated to be added at setup
Note
Not settable
- Type:
int
- export(file)[source]
Export the Bloom Filter to disk
- Parameters:
file (str) – The file or filepath to which the Bloom Filter will be written.
- Return type:
None
- export_c_header(filename)[source]
Export the Bloom Filter to disk as a C header file.
- Parameters:
filename (str) – The filename to which the Bloom Filter will be written.
- Return type:
None
- export_hex()[source]
Export the Bloom Filter as a hex string
- Returns:
Hex representation of the Bloom Filter
- Return type:
str
- export_size()[source]
Calculate the size of the bloom on disk
- Returns:
Size of the Bloom Filter when exported to disk
- Return type:
int
- property false_positive_rate: float
The maximum desired false positive rate
Note
Not settable
- Type:
float
- classmethod frombytes(b, hash_function=None)[source]
- Parameters:
b (ByteString) – The bytes to load as a Bloom Filter
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Bloom Filter object
- Return type:
- property hash_function: Callable[[str | bytes, int], List[int]]
The hash function used
Note
Not settable
- Type:
function
- hashes(key, depth=None)[source]
Return the hashes based on the provided key
- Parameters:
key (str) – Description of arg1
depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes
- Returns:
A list of the hashes for the key in int form
- Return type:
List(int)
- intersection(second)[source]
Return a new Bloom Filter that contains the intersection of the two
- Parameters:
second (BloomFilter) – The Bloom Filter with which to take the intersection
- Returns:
The new Bloom Filter containing the intersection
- Return type:
- Raises:
TypeError – When second is not either a
BloomFilter
orBloomFilterOnDisk
Note
second may be a BloomFilterOnDisk object
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None
- property is_on_disk: bool
Is the Bloom Filter on Disk or not
Note
Not settable
- Type:
bool
- jaccard_index(second)[source]
Calculate the jaccard similarity score between two Bloom Filters
- Parameters:
second (BloomFilter) – The Bloom Filter to compare with
- Returns:
A numeric value between 0 and 1 where 1 is identical and 0 means completely different
- Return type:
float
- Raises:
TypeError – When second is not either a
BloomFilter
orBloomFilterOnDisk
Note
second may be a BloomFilterOnDisk object
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None
- property number_bits: int
Number of bits in the Bloom Filter
Note
Not settable
- Type:
int
- property number_hashes: int
The number of hashes required for the Bloom Filter hashing strategy
Note
Not settable
- Type:
int
- union(second)[source]
Return a new Bloom Filter that contains the union of the two
- Parameters:
second (BloomFilter) – The Bloom Filter with which to calculate the union
- Returns:
The new Bloom Filter containing the union
- Return type:
- Raises:
TypeError – When second is not either a
BloomFilter
orBloomFilterOnDisk
Note
second may be a BloomFilterOnDisk object
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None
BloomFilterOnDisk
- class probables.BloomFilterOnDisk(filepath, est_elements=None, false_positive_rate=None, hex_string=None, hash_function=None)[source]
Simple Bloom Filter implementation directly on disk for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)
- Parameters:
filepath (str) – Path to file to load
est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
hex_string (str) – Hex based representation to be loaded
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Bloom Filter object
- Return type:
- Raises:
NotSupportedError – Loading using a hex string is not supported
Note
- Initialization order of operations:
Esimated elements and false positive rate
From Hex String
Only filepath provided
- add_alt(hashes)[source]
Add the element represented by hashes into the Bloom Filter
- Parameters:
hashes (list) – A list of integers representing the key to insert
- Return type:
None
- export(file)[source]
Export to disk if a different location
- Parameters:
file (str|Path) – The filename to which the Bloom Filter will be exported
- Return type:
None
Note
Only exported if the filename is not the original filename
For more information of all methods and properties, see BloomFilter.
ExpandingBloomFilter
- class probables.ExpandingBloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hash_function=None)[source]
Simple expanding Bloom Filter implementation for use in python; the Bloom Fiter will automatically expand, or grow, if the false positive rate is about to become greater than the desired false positive rate.
- Parameters:
est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
An expanding Bloom Filter object
- Return type:
Note
- Initialization order of operations:
Filepath
est_elements and false_positive_rate
- add(key, force=False)[source]
Add the key to the Bloom Filter
- Parameters:
key (str) – The element to be inserted
force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter
- Return type:
None
- add_alt(hashes, force=False)[source]
Add the element represented by hashes into the Bloom Filter
- Parameters:
hashes (list) – A list of integers representing the key to insert
force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter
- Return type:
None
- check(key)[source]
Check to see if the key is in the Bloom Filter
- Parameters:
key (str) – The key to check for in the Bloom Filter
- Returns:
True if the element is likely present; False if definately not present
- Return type:
bool
- check_alt(hashes)[source]
Check to see if the hashes are in the Bloom Filter
- Parameters:
hashes (list) – The hash representation to check for in the Bloom Filter
- Returns:
True if the element is likely present; False if definately not present
- Return type:
bool
- property elements_added: int
The total number of elements added
- Type:
int
- property estimated_elements: int
The original number of elements estimated to be in the Bloom Filter
- Type:
int
- property expansions: int
The number of expansions
- Type:
int
- export(file)[source]
Export an expanding Bloom Filter, or subclass, to disk
- Parameters:
filepath (str) – The path to the file to import
file (Path | str | IOBase | mmap) –
- Return type:
None
- property false_positive_rate: float
The desired false positive rate of the expanding Bloom Filter
- Type:
float
- classmethod frombytes(b, hash_function=None)[source]
- Parameters:
b (ByteString) – The bytes to load as a Expanding Bloom Filter
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Bloom Filter object
- Return type:
- property hash_function: Callable[[str | bytes, int], List[int]]
The total number of elements added
- Type:
int
RotatingBloomFilter
- class probables.RotatingBloomFilter(est_elements=None, false_positive_rate=None, max_queue_size=10, filepath=None, hash_function=None)[source]
Simple Rotating Bloom Filter implementation that allows for the “older” elements added to be removed, in chunks. As the queue fills up, those elements inserted earlier will be bulk removed. This also provides the user with the oportunity to force the removal instead of it being time based.
- Parameters:
est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
max_queue_size (int) – This is the number is used to determine the maximum number of Bloom Filters. Total elements added is based on max_queue_size * est_elements
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)
Note
- Initialization order of operations:
Filepath
est_elements and false_positive_rate
- add(key, force=False)
Add the key to the Bloom Filter
- Parameters:
key (str) – The element to be inserted
force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter
- Return type:
None
- add_alt(hashes, force=False)[source]
Add the element represented by hashes into the Bloom Filter
- Parameters:
hashes (list) – A list of integers representing the key to insert
force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter
- Return type:
None
- check(key)
Check to see if the key is in the Bloom Filter
- Parameters:
key (str) – The key to check for in the Bloom Filter
- Returns:
True if the element is likely present; False if definately not present
- Return type:
bool
- check_alt(hashes)
Check to see if the hashes are in the Bloom Filter
- Parameters:
hashes (list) – The hash representation to check for in the Bloom Filter
- Returns:
True if the element is likely present; False if definately not present
- Return type:
bool
- property current_queue_size: int
The current size of the queue
- Type:
int
- property elements_added: int
The total number of elements added
- Type:
int
- property estimated_elements: int
The original number of elements estimated to be in the Bloom Filter
- Type:
int
- property expansions: int
The number of expansions
- Type:
int
- export(file)
Export an expanding Bloom Filter, or subclass, to disk
- Parameters:
filepath (str) – The path to the file to import
file (Path | str | IOBase | mmap) –
- Return type:
None
- property false_positive_rate: float
The desired false positive rate of the expanding Bloom Filter
- Type:
float
- classmethod frombytes(b, max_queue_size, hash_function=None)[source]
- Parameters:
b (ByteString) – The bytes to load as a Expanding Bloom Filter
max_queue_size (int) – This is the number is used to determine the maximum number of Bloom Filters. Total elements added is based on max_queue_size * est_elements
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Bloom Filter object
- Return type:
- property hash_function: Callable[[str | bytes, int], List[int]]
The total number of elements added
- Type:
int
- property max_queue_size: int
The maximum size for the queue
- Type:
int
- pop()[source]
Pop the oldest Bloom Filter off of the queue without pushing a new Bloom Filter onto the queue
- Raises:
RotatingBloomFilterError – Unable to rotate the Bloom Filter
- Return type:
None
CountingBloomFilter
- class probables.CountingBloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]
Simple Counting Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/counting_bloom)
- Parameters:
est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
filepath (str) – Path to file to load
hex_string (str) – Hex based representation to be loaded
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Counting Bloom Filter object
- Return type:
Note
- Initialization order of operations:
From file
From Hex String
From params
- add(key, num_els=1)[source]
Add the key to the Counting Bloom Filter
- Parameters:
key (str) – The element to be inserted
num_els (int) – Number of times to insert the element
- Returns:
Maximum number of insertions
- Return type:
int
- add_alt(hashes, num_els=1)[source]
Add the element represented by hashes into the Counting Bloom Filter
- Parameters:
hashes (list) – A list of integers representing the key to insert
num_els (int) – Number of times to insert the element
- Returns:
Maximum number of insertions
- Return type:
int
- property bloom: array
The bit/int array
- Type:
list(int)
- property bloom_length: int
Length of the Bloom Filter array
Note
Not settable
- Type:
int
- check(key)[source]
Check if the key is likely in the Counting Bloom Filter
- Parameters:
key (str) – The element to be checked
- Returns:
Maximum number of insertions
- Return type:
int
- check_alt(hashes)[source]
Check if the element represented by hashes is in the Counting Bloom Filter
- Parameters:
hashes (list) – A list of integers representing the key to check
- Returns:
Maximum number of insertions
- Return type:
int
- clear()
Clear or reset the Counting Bloom Filter
- Return type:
None
- current_false_positive_rate()
Calculate the current false positive rate based on elements added
- Returns:
The current false positive rate
- Return type:
float
- property elements_added: int
Number of elements added to the Bloom Filter
Note
Changing this can cause the current false positive rate to be reported incorrectly
- Type:
int
- estimate_elements()
Estimate the number of unique elements added
- Returns:
Number of elements estimated to be inserted
- Return type:
int
Note
Returns -1 if all bits in the Bloom filter are set
- property estimated_elements: int
The maximum number of elements estimated to be added at setup
Note
Not settable
- Type:
int
- export(file)
Export the Bloom Filter to disk
- Parameters:
file (str) – The file or filepath to which the Bloom Filter will be written.
- Return type:
None
- export_c_header(filename)
Export the Bloom Filter to disk as a C header file.
- Parameters:
filename (str) – The filename to which the Bloom Filter will be written.
- Return type:
None
- export_hex()
Export the Bloom Filter as a hex string
- Returns:
Hex representation of the Bloom Filter
- Return type:
str
- export_size()
Calculate the size of the bloom on disk
- Returns:
Size of the Bloom Filter when exported to disk
- Return type:
int
- property false_positive_rate: float
The maximum desired false positive rate
Note
Not settable
- Type:
float
- classmethod frombytes(b, hash_function=None)[source]
- Parameters:
b (ByteString) – the bytes to load as a Counting Bloom Filter
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Counting Bloom Filter object
- Return type:
- property hash_function: Callable[[str | bytes, int], List[int]]
The hash function used
Note
Not settable
- Type:
function
- hashes(key, depth=None)
Return the hashes based on the provided key
- Parameters:
key (str) – Description of arg1
depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes
- Returns:
A list of the hashes for the key in int form
- Return type:
List(int)
- intersection(second)[source]
Take the intersection of two Counting Bloom Filters
- Parameters:
second (CountingBloomFilter) – The Bloom Filter with which to take the intersection
- Returns:
The new Counting Bloom Filter containing the union
- Return type:
- Raises:
TypeError – When second is not a
CountingBloomFilter
Note
The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None
- property is_on_disk: bool
Is the Bloom Filter on Disk or not
Note
Not settable
- Type:
bool
- jaccard_index(second)[source]
Take the Jaccard Index of two Counting Bloom Filters
- Parameters:
second (CountingBloomFilter) – The Bloom Filter with which to take the jaccard index
- Returns:
A numeric value between 0 and 1 where 1 is identical and 0 means completely different
- Return type:
float
- Raises:
TypeError – When second is not a
CountingBloomFilter
Note
The Jaccard Index is based on the unique set of elements added and not the number of each element added
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None
- property number_bits: int
Number of bits in the Bloom Filter
Note
Not settable
- Type:
int
- property number_hashes: int
The number of hashes required for the Bloom Filter hashing strategy
Note
Not settable
- Type:
int
- remove(key, num_els=1)[source]
Remove the element from the counting bloom
- Parameters:
key (str) – The element to be removed
num_els (int) – Number of times to remove the element
- Returns:
Maximum number of insertions after the removal
- Return type:
int
- remove_alt(hashes, num_els=1)[source]
Remvoe the element represented by hashes from the Counting Bloom Filter
- Parameters:
hashes (list) – A list of integers representing the key to remove
num_els (int) – Number of times to remove the element
- Returns:
Maximum number of insertions after the removal
- Return type:
int
- union(second)[source]
Return a new Countiong Bloom Filter that contains the union of the two
- Parameters:
second (CountingBloomFilter) – The Counting Bloom Filter with which to calculate the union
- Returns:
The new Counting Bloom Filter containing the union
- Return type:
- Raises:
TypeError – When second is not a
CountingBloomFilter
Note
The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None
Cuckoo Filters
Cuckoo filters are a space efficient data structure that supports set membership testing. Cuckoo filters support insertion, deletion, and lookup of elements with low overhead and few false positive results. The name is derived from the cuckoo hashing strategy used to resolve conflicts.
CuckooFilter
- class probables.CuckooFilter(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]
Simple Cuckoo Filter implementation
- Parameters:
capacity (int) – The number of bins
bucket_size (int) – The number of buckets per bin
max_swaps (int) – The number of cuckoo swaps before stopping
expansion_rate (int) – The rate at which to expand
auto_expand (bool) – If the filter should automatically expand
finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import
filepath (str) – The path to the file to load or None if no file
hash_function (function) – Hashing strategy function to use hf(key)
- Returns:
A Cuckoo Filter object
- Return type:
- add(key)[source]
Add element key to the filter
- Parameters:
key (str) – The element to add
- Raises:
CuckooFilterFullError – When element not inserted after maximum number of swaps or ‘kicks’
- property auto_expand: bool
True if the cuckoo filter will expand automatically
- Type:
bool
- property bucket_size: int
The number of buckets per bin
Note
Not settable
- Type:
int
- property buckets: List[List[int]]
The buckets holding the fingerprints
Note
Not settable
- Type:
list(list)
- property capacity: int
The number of bins
Note
Not settable
- Type:
int
- check(key)[source]
Check if an element is in the filter
- Parameters:
key (str) – Element to check
- Returns:
True if likely present, False if definately not
- Return type:
bool
- property elements_added: int
The number of elements added
Note
Not settable
- Type:
int
- property error_rate: float
The error rate of the cuckoo filter
- Type:
float
- property expansion_rate: int
The rate at expansion when the filter grows
- Type:
int
- export(file)[source]
Export cuckoo filter to file
- Parameters:
file (Path | str | IOBase | mmap) – Path to file to export
- Return type:
None
- property fingerprint_size: int
The size in bytes of the fingerprint
- Raises:
ValueError – If the size is not between 1 and 4
Note
The size of the fingerprint must be between 1 and 4
- Type:
int
- property fingerprint_size_bits: int
The size in bits of the fingerprint
- Type:
int
- classmethod frombytes(b, error_rate=None, hash_function=None)[source]
- Parameters:
b (ByteString) – The bytes to load as a Expanding Bloom Filter
error_rate (float) – The error rate of the cuckoo filter, if used to generate the original filter
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Bloom Filter object
- Return type:
- classmethod init_error_rate(error_rate, capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, hash_function=None)[source]
Initialize a simple Cuckoo Filter based on error rate
- Parameters:
error_rate (float) –
capacity (int) – The number of bins
bucket_size (int) – The number of buckets per bin
max_swaps (int) – The number of cuckoo swaps before stopping
expansion_rate (int) – The rate at which to expand
auto_expand (bool) – If the filter should automatically expand
hash_function (function) – Hashing strategy function to use hf(key)
- Returns:
A Cuckoo Filter object
- Return type:
- classmethod load_error_rate(error_rate, filepath, hash_function=None)[source]
Initialize a previously exported Cuckoo Filter based on error rate
- Parameters:
error_rate (float) –
filepath (str) – The path to the file to load or None if no file
hash_function (function) – Hashing strategy function to use hf(key)
- Returns:
A Cuckoo Filter object
- Return type:
- property max_swaps: int
The maximum number of swaps to perform
Note
Not settable
- Type:
int
CountingCuckooFilter
- class probables.CountingCuckooFilter(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]
Simple Counting Cuckoo Filter implementation
- Parameters:
capacity (int) – The number of bins
bucket_size (int) – The number of buckets per bin
max_swaps (int) – The number of cuckoo swaps before stopping
expansion_rate (int) – The rate at which to expand
auto_expand (bool) – If the filter should automatically expand
finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import
filename (str) – The path to the file to load or None if no file
filepath (str | Path | None) –
hash_function (Callable[[str | bytes, int], int] | None) –
- Returns:
A Cuckoo Filter object
- Return type:
- add(key)[source]
Add element key to the filter
- Parameters:
key (str) – The element to add
- Raises:
CuckooFilterFullError – When element not inserted after maximum number of swaps or ‘kicks’
- Return type:
None
- property auto_expand: bool
True if the cuckoo filter will expand automatically
- Type:
bool
- property bucket_size: int
The number of buckets per bin
Note
Not settable
- Type:
int
- property buckets: List[List[CountingCuckooBin]]
The buckets holding the fingerprints
Note
Not settable
- Type:
list(list)
- property capacity: int
The number of bins
Note
Not settable
- Type:
int
- check(key)[source]
Check if an element is in the filter
- Parameters:
key (str) – Element to check
- Returns:
The number of times inserted into the filter
- Return type:
int
- property elements_added: int
The number of elements added
Note
Not settable
- Type:
int
- property error_rate: float
The error rate of the cuckoo filter
- Type:
float
- property expansion_rate: int
The rate at expansion when the filter grows
- Type:
int
- export(file)[source]
Export cuckoo filter to file
- Parameters:
file (str) – Path to file to export
- Return type:
None
- property fingerprint_size: int
The size in bytes of the fingerprint
- Raises:
ValueError – If the size is not between 1 and 4
Note
The size of the fingerprint must be between 1 and 4
- Type:
int
- property fingerprint_size_bits: int
The size in bits of the fingerprint
- Type:
int
- classmethod frombytes(b, error_rate=None, hash_function=None)[source]
- Parameters:
b (ByteString) – The bytes to load as a Expanding Bloom Filter
error_rate (float) – The error rate of the cuckoo filter, if used to generate the original filter
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Bloom Filter object
- Return type:
- classmethod init_error_rate(error_rate, capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, hash_function=None)[source]
Initialize a simple Cuckoo Filter based on error rate
- Parameters:
error_rate (float) –
capacity (int) – The number of bins
bucket_size (int) – The number of buckets per bin
max_swaps (int) – The number of cuckoo swaps before stopping
expansion_rate (int) – The rate at which to expand
auto_expand (bool) – If the filter should automatically expand
hash_function (function) – Hashing strategy function to use hf(key)
- Returns:
A Cuckoo Filter object
- Return type:
- classmethod load_error_rate(error_rate, filepath, hash_function=None)[source]
Initialize a previously exported Cuckoo Filter based on error rate
- Parameters:
error_rate (float) –
filepath (str) – The path to the file to load or None if no file
hash_function (function) – Hashing strategy function to use hf(key)
- Returns:
A Cuckoo Filter object
- Return type:
- property max_swaps: int
The maximum number of swaps to perform
Note
Not settable
- Type:
int
- remove(key)[source]
Remove an element from the filter
- Parameters:
key (str) – Element to remove
- Return type:
bool
- property unique_elements: int
unique number of elements inserted
- Type:
int
Count-Min Sketches
Count-Min Sketches, and its derivatives, are good for estimating the number of occurrences of an element in streaming data while not needing to retain all the data elements. The result is a probabilistic count of elements inserted into the data structure. It will always provide the maximum number of times a data element was encountered. Notice that the result may be more than the true number of times it was inserted, but never fewer.
CountMinSketch
- class probables.CountMinSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]
Simple Count-Min Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)
- Parameters:
width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Count-Min Sketch object
- Return type:
Note
- Initialization order of operations:
From file
Width and depth
Confidence and error rate
Note
Default query type is min
Note
For width and depth, width may realistically be in the thousands while depth is in the single digit to teens
- add(key, num_els=1)[source]
Insert the element key into the count-min sketch
- Parameters:
key (str) – The element to insert
num_els (int) – The number of times to insert the element
- Returns:
The number of times the element was likely inserted after the insertion
- Return type:
int
- add_alt(hashes, num_els=1)[source]
Insert an element by using the hash representation
- Parameters:
key (str) – The element to insert
num_els (int) – The number of times to insert the element
hashes (List[int]) –
- Returns:
The number of times the element was likely inserted after the insertion
- Return type:
int
- check(key)[source]
Check number of times element ‘key’ is in the count-min sketch
- Parameters:
key (str) – The key to check the number of times inserted
- Returns:
The number of times the element was likely inserted
- Return type:
int
- check_alt(hashes)[source]
Check the count-min sketch for an element by using the hash representation
- Parameters:
hashes (list) – The hashes representing the element to check
- Returns:
The number of times the element was likely inserted
- Return type:
int
- property confidence: float
The confidence of the count-min sketch
Note
Not settable
- Type:
float
- property depth: int
The depth of the count-min sketch
Note
Not settable
- Type:
int
- property elements_added: int
The number of elements added to the count-min sketch
Note
Not settable
- Type:
int
- property error_rate: float
The error rate of the count-min sketch
Note
Not settable
- Type:
float
- export(file)[source]
Export the count-min sketch to disk
- Parameters:
filename (str) – The filename to which the count-min sketch will be written.
file (Path | str | IOBase | mmap) –
- Return type:
None
- classmethod frombytes(b, hash_function=None)[source]
- Parameters:
b (ByteString) – The bytes to load as a Count-Min Sketch
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A count-min sketch object
- Return type:
- hashes(key, depth=None)[source]
Return the hashes based on the provided key
- Parameters:
key (str) – Description of arg1
depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes
- Returns:
A list of the hashes for the key in int form
- Return type:
List(int)
- join(second)[source]
Join two count-min sketchs into a single count-min sketch; the calling count-min sketch will have the resulting combined data
- Parameters:
second (CountMinSketch) – The count-min sketch to merge
- Raises:
TypeError – When second is not either a
CountMinSketch
,:class:CountMeanSketch orCountMeanMinSketch
CountMinSketchError – Raised when the count-min sketches are not of the same dimensions
- Return type:
None
Note
The calling count-min sketch will contain the combined data once complete
- property query_type: str
The name of the query type being used
Note
- Valid values:
‘min’ or None
‘mean’
‘mean-min’
- Type:
str
- remove(key, num_els=1)[source]
Remove element ‘key’ from the count-min sketch
- Parameters:
key (str) – The element to remove
num_els (int) – The number of times to remove the element
- Returns:
The number of times the element was likely inserted after the removal
- Return type:
int
- remove_alt(hashes, num_els=1)[source]
Remove an element by using the hash representation
- Parameters:
hashes (list) – The hashes representing the element to remove
num_els (int) – The number of times to remove the element
- Returns:
The number of times the element was likely inserted after the removal
- Return type:
int
- property width: int
The width of the count-min sketch
Note
Not settable
- Type:
int
CountMeanSketch
- class probables.CountMeanSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]
Simple Count-Mean Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)
- Parameters:
width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Count-Mean Sketch object
- Return type:
Note
- Initialization order of operations:
From file
Width and depth
Confidence and error rate
Note
Default query type is mean
Note
For width and depth, width may realistically be in the thousands while depth is in the single digit to teens
For more information of all methods and properties, see CountMinSketch.
CountMeanMinSketch
- class probables.CountMeanMinSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]
Simple Count-Mean-Min Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)
- Parameters:
width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Count-Mean-Min Sketch object
- Return type:
Note
- Initialization order of operations:
From file
Width and depth
Confidence and error rate
Note
Default query type is mean-min
Note
For width and depth, width may realistically be in the thousands while depth is in the single digit to teens
For more information of all methods and properties, see CountMinSketch.
HeavyHitters
- class probables.HeavyHitters(num_hitters=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]
Find and track those elements that are the most common, or heavy hitters
- Parameters:
num_hitters (int) – The maximum number of distinct elements to track
width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Count-Min Sketch object
- Return type:
Note
- Initialization order of operations:
From file
Width and depth
Confidence and error rate
Note
Default query type is min
Note
For width and depth, width may realistically be in the thousands while depth is in the single digit to teens
- add(key, num_els=1)[source]
Add element to heavy hitters
- Parameters:
key (str) – The element to add
num_els (int) – The number of instances to add
- Returns:
Number of times key has been inserted
- Return type:
int
Note
Override function
- add_alt(key, hashes, num_els=1)[source]
Add the element key represented as hashes to the HeavyHitters object (hence the different signature on the function!)
- Parameters:
key (str) – The element to add
hashes (list) – The list of integers representing the key to insert
num_els (int) – The number of instances to add
- Returns:
Number of times key has been inserted
- Return type:
int
Note
Different key signature than the normal
CountMinSketch
Note
Override function
- classmethod frombytes(b, num_hitters=100, hash_function=None)[source]
- Parameters:
b (ByteString) – The bytes to load as a Expanding Bloom Filter
num_hitters (int) – The maximum number of distinct elements to track
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Bloom Filter object
- Return type:
- property heavy_hitters: Dict[str, int]
Return the heavy hitters, or most common elements
Note
Not settable
- Type:
dict
- join(second)[source]
Join is not supported by HeavyHitters
- Raises:
NotSupportedError – This functionality is currently not supported
- Parameters:
second (HeavyHitters) –
- property number_heavy_hitters: int
Return the maximum number of heavy hitters being tracked
Note
Not settable
- Type:
int
- remove_alt(hashes, num_els=1)[source]
Remove element based on hashes provided; not supported in heavy hitters
- Raises:
NotSupportedError – This function is not supported by the HeavyHitters class
- Parameters:
hashes (List[int]) –
num_els (int) –
Note
Override function
For more information of all methods and properties, see CountMinSketch.
StreamThreshold
- class probables.StreamThreshold(threshold=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]
Find and track those elements that are above a certain threshold
- Parameters:
threshold (int) – The threshold above which an element will be tracked
width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Count-Min Sketch object
- Return type:
Note
- Initialization order of operations:
From file
Width and depth
Confidence and error rate
Note
Default query type is min
Note
For width and depth, width may realistically be in the thousands while depth is in the single digit to teens
- add(key, num_els=1)[source]
Add the element for key into the data structure
- Parameters:
key (str) – The element to add
num_els (int) – The number of instances to add
- Returns:
Number of times key has been inserted
- Return type:
int
Note
Override function
- add_alt(key, hashes, num_els=1)[source]
Add the element for key into the data structure
- Parameters:
key (str) – The element to add
hashes (list) – The list of integers representing the key to insert
num_els (int) – The number of instances to add
- Returns:
Number of times key has been inserted
- Return type:
int
Note
Different key signature than the normal
CountMinSketch
Note
Override function
- classmethod frombytes(b, threshold=100, hash_function=None)[source]
- Parameters:
b (ByteString) – The bytes to load as a Expanding Bloom Filter
threshold (int) – The threshold above which an element will be tracked
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
A Bloom Filter object
- Return type:
- join(second)[source]
Join is not supported by StreamThreshold
- Raises:
NotSupportedError – This functionality is currently not supported
- Parameters:
second (StreamThreshold) –
- property meets_threshold: Dict[str, int]
Those keys that meet the required threshold (with value)
- Type:
dict
- remove(key, num_els=1)[source]
Remove element ‘key’ from the count-min sketch
- Parameters:
key (str) – The element to remove
num_els (int) – The number of times to remove the element
- Returns:
The number of times the element was likely inserted after the removal
- Return type:
int
Note
Override function
- remove_alt(key, hashes, num_els=1)[source]
Remove an element by using the hash representation
- Parameters:
key (str) – The key that the hashes represent
hashes (list) – The hashes representing the element to remove
num_els (int) – The number of times to remove the element
- Returns:
The number of times the element was likely inserted after the removal
- Return type:
int
Note
Different key signature than the normal
CountMinSketch
Note
Override function
- property threshold: int
The threshold at which a key is tracked
- Type:
int
For more information of all methods and properties, see CountMinSketch.
QuotientFilter
Quotient filters are an aproximate membership query filter (AMQ) that is both space efficient and returns a zero false negative rate and a probablistic false positive rate. Unlike Bloom filters, the quotient filter only requires a single hash of the element to insert. The upper q bits denote the location within the filter while the lower r bits are stored in the filter.
Quotient filters provide some useful benifits over Bloom filters including:
Merging of two filters (not union)
Resizing of the filter
Ability to remove elements
QuotientFilter
- class probables.QuotientFilter(quotient=20, hash_function=None)[source]
Simple Quotient Filter implementation
- Parameters:
quotient (int) – The size of the quotient to use
hash_function (function) – Hashing strategy function to use hf(key, number)
- Returns:
The initialized filter
- Return type:
- Raises:
ValueError –
Note
The size of the QuotientFilter will be 2**q
- add(key)[source]
Add key to the quotient filter
- Parameters:
key (str|bytes) – The element to add
- Return type:
None
- property bits_per_elm
The number of bits used per element
- Type:
int
- check(key)[source]
Check to see if key is likely in the quotient filter
- Parameters:
key (str|bytes) – The element to add
- Returns:
True if likely encountered, False if definately not
- Return type:
bool
- property elements_added: int
The number of elements added to the filter
- Type:
int
- property num_elements: int
The total size of the filter
- Type:
int
- property quotient: int
The size of the quotient, in bits
- Type:
int
- property remainder: int
The size of the remainder, in bits
- Type:
int
Utilities
Bitarray
- class probables.utilities.Bitarray(size)[source]
Simplified, pure python bitarray implementation using as little memory as possible
- Parameters:
size (int) – The number of bits in the bitarray
- Returns:
A bitarray
- Return type:
- Raises:
TypeError –
ValueError –
- as_string()[source]
String representation of the bitarray
- Returns:
Bitarray representation as a string
- Return type:
str
- property bitarray: array
The bitarray
- check_bit(idx)[source]
Check if the bit idx is set
- Parameters:
idx (int) – The index to check
- Returns:
The status of the bit, either 0 or 1
- Return type:
int
- clear_bit(idx)[source]
Set the bit at idx to 0
- Parameters:
idx (int) – The index to clear
- Return type:
None
- is_bit_set(idx)[source]
Check if the bit idx is set
- Parameters:
idx (int) – The index to check
- Returns:
The status of the bit, either 0 or 1
- Return type:
int
- num_bits_set()[source]
Number of bits set in the bitarray
- Returns:
Number of bits set
- Return type:
int
- set_bit(idx)[source]
Set the bit at idx to 1
- Parameters:
idx (int) – The index to set
- Return type:
None
- property size: int
The number of bits in the bitarray
- property size_bytes: int
The size of the bitarray in bytes
Exceptions
PyProbables Exceptions
- exception probables.exceptions.CountMinSketchError(message)[source]
CountMinSketch Exception
- Parameters:
message (str) – The error message to be reported
- Return type:
None
- exception probables.exceptions.CuckooFilterFullError(message)[source]
Cuckoo Filter Full Exception
- Parameters:
message (str) – The error message to be reported
- Return type:
None
- exception probables.exceptions.InitializationError(message)[source]
Initialization Exception
- Parameters:
message (str) – The initialization error messge
- Return type:
None
- exception probables.exceptions.NotSupportedError(message)[source]
Not Supported Functionality Exception
- Parameters:
message (str) – The error message to be reported
- Return type:
None
Hashing Functions
Probables Hashing Utilities
- probables.hashes.default_fnv_1a(key, depth=1)[source]
The default fnv-1a hashing routine
- Parameters:
key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute
- Returns:
List of size depth hashes
- Return type:
list(int)
- probables.hashes.default_md5(key, *args, **kwargs)[source]
The default md5 hashing routine
- Parameters:
key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute
- Returns:
List of 64-bit hashed representation of key hashes
- Return type:
list(int)
Note
Returns the upper-most 64 bits
- probables.hashes.default_sha256(key, *args, **kwargs)[source]
The default sha256 hashing routine
- Parameters:
key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute
- Returns:
List of 64-bit hashed representation of key hashes
- Return type:
list(int)
Note
Returns the upper-most 64 bits
- probables.hashes.fnv_1a(key, seed=0)[source]
Pure python implementation of the 64 bit fnv-1a hash
- Parameters:
key (str) – The element to be hashed
seed (int) – Add a seed to the initial starting point (0 means no seed)
- Returns:
64-bit hashed representation of key
- Return type:
int
Note
Uses the lower 64 bits when overflows occur
- probables.hashes.fnv_1a_32(key, seed=0)[source]
Pure python implementation of the 32 bit fnv-1a hash :param key: The element to be hashed :type key: str :param seed: Add a seed to the initial starting point (0 means no seed) :type seed: int
- Returns:
32-bit hashed representation of key
- Return type:
int
- Parameters:
key (str | bytes) –
seed (int) –
Note
Uses the lower 32 bits when overflows occur
- probables.hashes.hash_with_depth_bytes(func)[source]
Decorator to turns a function taking a single key and hashes it to bytes. Wraps functions to be used in Bloom filters and Count-Min sketch data structures.
- Parameters:
key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute
func (Callable[[str | bytes, int], bytes]) –
- Returns:
64-bit hashed representation of key
- Return type:
list(int)
Note
Arguments shown are as it will be after decorated
- probables.hashes.hash_with_depth_int(func)[source]
Decorator to turn a function that takes a single key and hashes it to an int. Wraps functions to be used in Bloom filters and Count-Min sketch data structures.
- Parameters:
key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute
func (Callable[[str | bytes, int], List[int]]) –
- Returns:
64-bit hashed representation of key
- Return type:
list(int)
Note
Arguments shown are as it will be after decorated