pyprobables API

Here you can find the full developer API for the pyprobables project. pyprobables provides a suite of probabilistic data-structures to be used in data analytics and data science projects.

Data Structures and Classes

Bloom Filters

Bloom Filters are a class of probabilistic data structures used for set operations. Bloom Filters guarantee a zero percent false negative rate and a predetermined false positive rate. Once the number of elements inserted exceeds the estimated elements, the false positive rate will increase over the desired amount.

Further Reading

BloomFilter

class probables.BloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]

Simple Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)

Parameters:

est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
filepath (str) – Path to file to load
hex_string (str) – Hex based representation to be loaded
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

BloomFilter

Note

Initialization order of operations:

From file
From Hex String
From params

add(key)[source]

Add the key to the Bloom Filter

Parameters:: key (str) – The element to be inserted
Return type:: None

add_alt(hashes)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:: hashes (list) – A list of integers representing the key to insert
Return type:: None

property bloom: array

The bit/int array

Type:: list(int)

property bloom_length: int

Length of the Bloom Filter array

Note

Not settable

Type:: int

check(key)[source]

Check if the key is likely in the Bloom Filter

Parameters:: key (str) – The element to be checked
Returns:: True if likely encountered, False if definately not
Return type:: bool

check_alt(hashes)[source]

Check if the element represented by hashes is in the Bloom Filter

Parameters:: hashes (list) – A list of integers representing the key to check
Returns:: True if likely encountered, False if definately not
Return type:: bool

clear()[source]

Clear or reset the Counting Bloom Filter

Return type:: None

current_false_positive_rate()[source]

Calculate the current false positive rate based on elements added

Returns:: The current false positive rate
Return type:: float

property elements_added: int

Number of elements added to the Bloom Filter

Note

Changing this can cause the current false positive rate to be reported incorrectly

Type:: int

estimate_elements()[source]

Estimate the number of unique elements added

Returns:: Number of elements estimated to be inserted
Return type:: int

Note

Returns -1 if all bits in the Bloom filter are set

property estimated_elements: int

The maximum number of elements estimated to be added at setup

Note

Not settable

Type:: int

export(file)[source]

Export the Bloom Filter to disk

Parameters:: file (str) – The file or filepath to which the Bloom Filter will be written.
Return type:: None

export_c_header(filename)[source]

Export the Bloom Filter to disk as a C header file.

Parameters:: filename (str) – The filename to which the Bloom Filter will be written.
Return type:: None

export_hex()[source]

Export the Bloom Filter as a hex string

Returns:: Hex representation of the Bloom Filter
Return type:: str

export_size()[source]

Calculate the size of the bloom on disk

Returns:: Size of the Bloom Filter when exported to disk
Return type:: int

property false_positive_rate: float

The maximum desired false positive rate

Note

Not settable

Type:: float

classmethod frombytes(b, hash_function=None)[source]

Parameters:

b (ByteString) – The bytes to load as a Bloom Filter
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

BloomFilter

property hash_function: Callable[[str | bytes, int], list[int]]

The hash function used

Note

Not settable

Type:: function

hashes(key, depth=None)[source]

Return the hashes based on the provided key

Parameters:

key (str) – Description of arg1
depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes

Returns:

A list of the hashes for the key in int form

Return type:

List(int)

intersection(second)[source]

Return a new Bloom Filter that contains the intersection of the two

Parameters:

second (BloomFilter) – The Bloom Filter with which to take the intersection

Returns:

The new Bloom Filter containing the intersection

Return type:

BloomFilter

Raises:

TypeError – When second is not either a BloomFilter or BloomFilterOnDisk
SimilarityError – When second is not of the same size (false_positive_rate and est_elements)

Note

second may be a BloomFilterOnDisk object

property is_on_disk: bool

Is the Bloom Filter on Disk or not

Note

Not settable

Type:: bool

jaccard_index(second)[source]

Calculate the jaccard similarity score between two Bloom Filters

Parameters:

second (BloomFilter) – The Bloom Filter to compare with

Returns:

A numeric value between 0 and 1 where 1 is identical and 0 means completely different

Return type:

float

Raises:

TypeError – When second is not either a BloomFilter or BloomFilterOnDisk
SimilarityError – When second is not of the same size (false_positive_rate and est_elements)

Note

second may be a BloomFilterOnDisk object

property number_bits: int

Number of bits in the Bloom Filter

Note

Not settable

Type:: int

property number_hashes: int

The number of hashes required for the Bloom Filter hashing strategy

Note

Not settable

Type:: int

union(second)[source]

Return a new Bloom Filter that contains the union of the two

Parameters:

second (BloomFilter) – The Bloom Filter with which to calculate the union

Returns:

The new Bloom Filter containing the union

Return type:

BloomFilter

Raises:

TypeError – When second is not either a BloomFilter or BloomFilterOnDisk
SimilarityError – When second is not of the same size (false_positive_rate and est_elements)

Note

second may be a BloomFilterOnDisk object

BloomFilterOnDisk

class probables.BloomFilterOnDisk(filepath, est_elements=None, false_positive_rate=None, hex_string=None, hash_function=None)[source]

Simple Bloom Filter implementation directly on disk for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)

Parameters:

filepath (str) – Path to file to load
est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
hex_string (str) – Hex based representation to be loaded
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

BloomFilterOnDisk

Raises:

NotSupportedError – Loading using a hex string is not supported

Note

Initialization order of operations:

Esimated elements and false positive rate
From Hex String
Only filepath provided

add_alt(hashes)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:: hashes (list) – A list of integers representing the key to insert
Return type:: None

close()[source]

Clean up the BloomFilterOnDisk object

Return type:: None

export(file)[source]

Export to disk if a different location

Parameters:: file (str|Path) – The filename to which the Bloom Filter will be exported
Return type:: None

Note

Only exported if the filename is not the original filename

classmethod frombytes(b, hash_function=None)[source]

Raises: NotSupportedError

Parameters:

b (ByteString)
hash_function (Callable[[str | bytes, int], list[int]] | None)

Return type:

BloomFilterOnDisk

For more information of all methods and properties, see BloomFilter.

ExpandingBloomFilter

class probables.ExpandingBloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hash_function=None)[source]

Simple expanding Bloom Filter implementation for use in python; the Bloom Fiter will automatically expand, or grow, if the false positive rate is about to become greater than the desired false positive rate.

Parameters:

est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

An expanding Bloom Filter object

Return type:

ExpandingBloomFilter

Note

Initialization order of operations:

Filepath
est_elements and false_positive_rate

add(key, force=False)[source]

Add the key to the Bloom Filter

Parameters:

key (str) – The element to be inserted
force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

Return type:

None

add_alt(hashes, force=False)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:

hashes (list) – A list of integers representing the key to insert
force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

Return type:

None

check(key)[source]

Check to see if the key is in the Bloom Filter

Parameters:: key (str) – The key to check for in the Bloom Filter
Returns:: True if the element is likely present; False if definately not present
Return type:: bool

check_alt(hashes)[source]

Check to see if the hashes are in the Bloom Filter

Parameters:: hashes (list) – The hash representation to check for in the Bloom Filter
Returns:: True if the element is likely present; False if definately not present
Return type:: bool

property elements_added: int

The total number of elements added

Type:: int

property estimated_elements: int

The original number of elements estimated to be in the Bloom Filter

Type:: int

property expansions: int

The number of expansions

Type:: int

export(file)[source]

Export an expanding Bloom Filter, or subclass, to disk

Parameters:

filepath (str) – The path to the file to import
file (Path | str | IOBase | mmap)

Return type:

None

property false_positive_rate: float

The desired false positive rate of the expanding Bloom Filter

Type:: float

classmethod frombytes(b, hash_function=None)[source]

Parameters:

b (ByteString) – The bytes to load as a Expanding Bloom Filter
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

ExpandingBloomFilter

property hash_function: Callable[[str | bytes, int], list[int]]

The total number of elements added

Type:: int

push()[source]

Push a new expansion onto the Bloom Filter

Return type:: None

RotatingBloomFilter

class probables.RotatingBloomFilter(est_elements=None, false_positive_rate=None, max_queue_size=10, filepath=None, hash_function=None)[source]

Simple Rotating Bloom Filter implementation that allows for the “older” elements added to be removed, in chunks. As the queue fills up, those elements inserted earlier will be bulk removed. This also provides the user with the oportunity to force the removal instead of it being time based.

Parameters:

est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
max_queue_size (int) – This is the number is used to determine the maximum number of Bloom Filters. Total elements added is based on max_queue_size * est_elements
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)

Note

Initialization order of operations:

Filepath
est_elements and false_positive_rate

add(key, force=False)

Add the key to the Bloom Filter

Parameters:

key (str) – The element to be inserted
force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

Return type:

None

add_alt(hashes, force=False)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:

hashes (list) – A list of integers representing the key to insert
force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

Return type:

None

check(key)

Check to see if the key is in the Bloom Filter

Parameters:: key (str) – The key to check for in the Bloom Filter
Returns:: True if the element is likely present; False if definately not present
Return type:: bool

check_alt(hashes)

Check to see if the hashes are in the Bloom Filter

Parameters:: hashes (list) – The hash representation to check for in the Bloom Filter
Returns:: True if the element is likely present; False if definately not present
Return type:: bool

property current_queue_size: int

The current size of the queue

Type:: int

property elements_added: int

The total number of elements added

Type:: int

property estimated_elements: int

The original number of elements estimated to be in the Bloom Filter

Type:: int

property expansions: int

The number of expansions

Type:: int

export(file)

Export an expanding Bloom Filter, or subclass, to disk

Parameters:

filepath (str) – The path to the file to import
file (Path | str | IOBase | mmap)

Return type:

None

property false_positive_rate: float

The desired false positive rate of the expanding Bloom Filter

Type:: float

classmethod frombytes(b, max_queue_size, hash_function=None)[source]

Parameters:

b (ByteString) – The bytes to load as a Expanding Bloom Filter
max_queue_size (int) – This is the number is used to determine the maximum number of Bloom Filters. Total elements added is based on max_queue_size * est_elements
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

RotatingBloomFilter

property hash_function: Callable[[str | bytes, int], list[int]]

The total number of elements added

Type:: int

property max_queue_size: int

The maximum size for the queue

Type:: int

pop()[source]

Pop the oldest Bloom Filter off of the queue without pushing a new Bloom Filter onto the queue

Raises:: RotatingBloomFilterError – Unable to rotate the Bloom Filter
Return type:: None

push()[source]

Push a new bloom filter onto the queue and rotate if necessary

Return type:: None

CountingBloomFilter

class probables.CountingBloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]

Simple Counting Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/counting_bloom)

Parameters:

est_elements (int) – The number of estimated elements to be added
false_positive_rate (float) – The desired false positive rate
filepath (str) – Path to file to load
hex_string (str) – Hex based representation to be loaded
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Counting Bloom Filter object

Return type:

CountingBloomFilter

Note

Initialization order of operations:

From file
From Hex String
From params

add(key, num_els=1)[source]

Add the key to the Counting Bloom Filter

Parameters:

key (str) – The element to be inserted
num_els (int) – Number of times to insert the element

Returns:

Maximum number of insertions

Return type:

int

add_alt(hashes, num_els=1)[source]

Add the element represented by hashes into the Counting Bloom Filter

Parameters:

hashes (list) – A list of integers representing the key to insert
num_els (int) – Number of times to insert the element

Returns:

Maximum number of insertions

Return type:

int

property bloom: array

The bit/int array

Type:: list(int)

property bloom_length: int

Length of the Bloom Filter array

Note

Not settable

Type:: int

check(key)[source]

Check if the key is likely in the Counting Bloom Filter

Parameters:: key (str) – The element to be checked
Returns:: Maximum number of insertions
Return type:: int

check_alt(hashes)[source]

Check if the element represented by hashes is in the Counting Bloom Filter

Parameters:: hashes (list) – A list of integers representing the key to check
Returns:: Maximum number of insertions
Return type:: int

clear()

Clear or reset the Counting Bloom Filter

Return type:: None

current_false_positive_rate()

Calculate the current false positive rate based on elements added

Returns:: The current false positive rate
Return type:: float

property elements_added: int

Number of elements added to the Bloom Filter

Note

Changing this can cause the current false positive rate to be reported incorrectly

Type:: int

estimate_elements()

Estimate the number of unique elements added

Returns:: Number of elements estimated to be inserted
Return type:: int

Note

Returns -1 if all bits in the Bloom filter are set

property estimated_elements: int

The maximum number of elements estimated to be added at setup

Note

Not settable

Type:: int

export(file)

Export the Bloom Filter to disk

Parameters:: file (str) – The file or filepath to which the Bloom Filter will be written.
Return type:: None

export_c_header(filename)

Export the Bloom Filter to disk as a C header file.

Parameters:: filename (str) – The filename to which the Bloom Filter will be written.
Return type:: None

export_hex()

Export the Bloom Filter as a hex string

Returns:: Hex representation of the Bloom Filter
Return type:: str

export_size()

Calculate the size of the bloom on disk

Returns:: Size of the Bloom Filter when exported to disk
Return type:: int

property false_positive_rate: float

The maximum desired false positive rate

Note

Not settable

Type:: float

classmethod frombytes(b, hash_function=None)[source]

Parameters:

b (ByteString) – the bytes to load as a Counting Bloom Filter
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Counting Bloom Filter object

Return type:

CountingBloomFilter

property hash_function: Callable[[str | bytes, int], list[int]]

The hash function used

Note

Not settable

Type:: function

hashes(key, depth=None)

Return the hashes based on the provided key

Parameters:

key (str) – Description of arg1
depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes

Returns:

A list of the hashes for the key in int form

Return type:

List(int)

intersection(second)[source]

Take the intersection of two Counting Bloom Filters

Parameters:

second (CountingBloomFilter) – The Bloom Filter with which to take the intersection

Returns:

The new Counting Bloom Filter containing the union

Return type:

CountingBloomFilter

Raises:

TypeError – When second is not a CountingBloomFilter
SimilarityError – When second is not of the same size (false_positive_rate and est_elements)

Note

The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()

property is_on_disk: bool

Is the Bloom Filter on Disk or not

Note

Not settable

Type:: bool

jaccard_index(second)[source]

Take the Jaccard Index of two Counting Bloom Filters

Parameters:

second (CountingBloomFilter) – The Bloom Filter with which to take the jaccard index

Returns:

A numeric value between 0 and 1 where 1 is identical and 0 means completely different

Return type:

float

Raises:

TypeError – When second is not a CountingBloomFilter
SimilarityError – When second is not of the same size (false_positive_rate and est_elements)

Note

The Jaccard Index is based on the unique set of elements added and not the number of each element added

property number_bits: int

Number of bits in the Bloom Filter

Note

Not settable

Type:: int

property number_hashes: int

The number of hashes required for the Bloom Filter hashing strategy

Note

Not settable

Type:: int

remove(key, num_els=1)[source]

Remove the element from the counting bloom

Parameters:

key (str) – The element to be removed
num_els (int) – Number of times to remove the element

Returns:

Maximum number of insertions after the removal

Return type:

int

remove_alt(hashes, num_els=1)[source]

Remvoe the element represented by hashes from the Counting Bloom Filter

Parameters:

hashes (list) – A list of integers representing the key to remove
num_els (int) – Number of times to remove the element

Returns:

Maximum number of insertions after the removal

Return type:

int

union(second)[source]

Return a new Countiong Bloom Filter that contains the union of the two

Parameters:

second (CountingBloomFilter) – The Counting Bloom Filter with which to calculate the union

Returns:

The new Counting Bloom Filter containing the union

Return type:

CountingBloomFilter

Raises:

TypeError – When second is not a CountingBloomFilter
SimilarityError – When second is not of the same size (false_positive_rate and est_elements)

Note

The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()

Cuckoo Filters

Cuckoo filters are a space efficient data structure that supports set membership testing. Cuckoo filters support insertion, deletion, and lookup of elements with low overhead and few false positive results. The name is derived from the cuckoo hashing strategy used to resolve conflicts.

Further Reading

CuckooFilter

class probables.CuckooFilter(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]

Simple Cuckoo Filter implementation

Parameters:

capacity (int) – The number of bins
bucket_size (int) – The number of buckets per bin
max_swaps (int) – The number of cuckoo swaps before stopping
expansion_rate (int) – The rate at which to expand
auto_expand (bool) – If the filter should automatically expand
finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import
filepath (str) – The path to the file to load or None if no file
hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

add(key)[source]

Add element key to the filter

Parameters:: key (str) – The element to add
Raises:: CuckooFilterFullError – When element not inserted after maximum number of swaps or ‘kicks’

property auto_expand: bool

True if the cuckoo filter will expand automatically

Type:: bool

property bucket_size: int

The number of buckets per bin

Note

Not settable

Type:: int

property buckets: list[list[int]]

The buckets holding the fingerprints

Note

Not settable

Type:: list(list)

property capacity: int

The number of bins

Note

Not settable

Type:: int

check(key)[source]

Check if an element is in the filter

Parameters:: key (str) – Element to check
Returns:: True if likely present, False if definately not
Return type:: bool

property elements_added: int

The number of elements added

Note

Not settable

Type:: int

property error_rate: float

The error rate of the cuckoo filter

Type:: float

expand()[source]: Expand the cuckoo filter

property expansion_rate: int

The rate at expansion when the filter grows

Type:: int

export(file)[source]

Export cuckoo filter to file

Parameters:: file (Path | str | IOBase | mmap) – Path to file to export
Return type:: None

property fingerprint_size: int

The size in bytes of the fingerprint

Raises:: ValueError – If the size is not between 1 and 4

Note

The size of the fingerprint must be between 1 and 4

Type:: int

property fingerprint_size_bits: int

The size in bits of the fingerprint

Type:: int

classmethod frombytes(b, error_rate=None, hash_function=None)[source]

Parameters:

b (ByteString) – The bytes to load as a Expanding Bloom Filter
error_rate (float) – The error rate of the cuckoo filter, if used to generate the original filter
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

CuckooFilter

classmethod init_error_rate(error_rate, capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, hash_function=None)[source]

Initialize a simple Cuckoo Filter based on error rate

Parameters:

error_rate (float)
capacity (int) – The number of bins
bucket_size (int) – The number of buckets per bin
max_swaps (int) – The number of cuckoo swaps before stopping
expansion_rate (int) – The rate at which to expand
auto_expand (bool) – If the filter should automatically expand
hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

classmethod load_error_rate(error_rate, filepath, hash_function=None)[source]

Initialize a previously exported Cuckoo Filter based on error rate

Parameters:

error_rate (float)
filepath (str) – The path to the file to load or None if no file
hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

load_factor()[source]

float: How full the Cuckoo Filter is currently

Return type:: float

property max_swaps: int

The maximum number of swaps to perform

Note

Not settable

Type:: int

remove(key)[source]

Remove an element from the filter

Parameters:: key (str) – Element to remove
Returns:: True if removed, False if not present
Return type:: bool

CountingCuckooFilter

class probables.CountingCuckooFilter(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]

Simple Counting Cuckoo Filter implementation

Parameters:

capacity (int) – The number of bins
bucket_size (int) – The number of buckets per bin
max_swaps (int) – The number of cuckoo swaps before stopping
expansion_rate (int) – The rate at which to expand
auto_expand (bool) – If the filter should automatically expand
finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import
filename (str) – The path to the file to load or None if no file
filepath (str | Path | None)
hash_function (Callable[[str | bytes, int], int] | None)

Returns:

A Cuckoo Filter object

Return type:

CountingCuckooFilter

add(key)[source]

Add element key to the filter

Parameters:: key (str) – The element to add
Raises:: CuckooFilterFullError – When element not inserted after maximum number of swaps or ‘kicks’
Return type:: None

property auto_expand: bool

True if the cuckoo filter will expand automatically

Type:: bool

property bucket_size: int

The number of buckets per bin

Note

Not settable

Type:: int

property buckets: list[list[CountingCuckooBin]]

The buckets holding the fingerprints

Note

Not settable

Type:: list(list)

property capacity: int

The number of bins

Note

Not settable

Type:: int

check(key)[source]

Check if an element is in the filter

Parameters:: key (str) – Element to check
Returns:: The number of times inserted into the filter
Return type:: int

property elements_added: int

The number of elements added

Note

Not settable

Type:: int

property error_rate: float

The error rate of the cuckoo filter

Type:: float

expand()[source]: Expand the cuckoo filter

property expansion_rate: int

The rate at expansion when the filter grows

Type:: int

export(file)[source]

Export cuckoo filter to file

Parameters:: file (str) – Path to file to export
Return type:: None

property fingerprint_size: int

The size in bytes of the fingerprint

Raises:: ValueError – If the size is not between 1 and 4

Note

The size of the fingerprint must be between 1 and 4

Type:: int

property fingerprint_size_bits: int

The size in bits of the fingerprint

Type:: int

classmethod frombytes(b, error_rate=None, hash_function=None)[source]

Parameters:

b (ByteString) – The bytes to load as a Expanding Bloom Filter
error_rate (float) – The error rate of the cuckoo filter, if used to generate the original filter
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

CountingCuckooFilter

classmethod init_error_rate(error_rate, capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, hash_function=None)[source]

Initialize a simple Cuckoo Filter based on error rate

Parameters:

error_rate (float)
capacity (int) – The number of bins
bucket_size (int) – The number of buckets per bin
max_swaps (int) – The number of cuckoo swaps before stopping
expansion_rate (int) – The rate at which to expand
auto_expand (bool) – If the filter should automatically expand
hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

classmethod load_error_rate(error_rate, filepath, hash_function=None)[source]

Initialize a previously exported Cuckoo Filter based on error rate

Parameters:

error_rate (float)
filepath (str) – The path to the file to load or None if no file
hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

load_factor()[source]

float: How full the Cuckoo Filter is currently

Return type:: float

property max_swaps: int

The maximum number of swaps to perform

Note

Not settable

Type:: int

remove(key)[source]

Remove an element from the filter

Parameters:: key (str) – Element to remove
Return type:: bool

property unique_elements: int

unique number of elements inserted

Type:: int

Count-Min Sketches

Count-Min Sketches, and its derivatives, are good for estimating the number of occurrences of an element in streaming data while not needing to retain all the data elements. The result is a probabilistic count of elements inserted into the data structure. It will always provide the maximum number of times a data element was encountered. Notice that the result may be more than the true number of times it was inserted, but never fewer.

Further Reading

CountMinSketch

class probables.CountMinSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Simple Count-Min Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)

Parameters:

width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Min Sketch object

Return type:

CountMinSketch

Note

Initialization order of operations:

From file
Width and depth
Confidence and error rate

Note

Default query type is min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add(key, num_els=1)[source]

Insert the element key into the count-min sketch

Parameters:

key (str) – The element to insert
num_els (int) – The number of times to insert the element

Returns:

The number of times the element was likely inserted after the insertion

Return type:

int

add_alt(hashes, num_els=1)[source]

Insert an element by using the hash representation

Parameters:

key (str) – The element to insert
num_els (int) – The number of times to insert the element
hashes (list[int])

Returns:

The number of times the element was likely inserted after the insertion

Return type:

int

check(key)[source]

Check number of times element ‘key’ is in the count-min sketch

Parameters:: key (str) – The key to check the number of times inserted
Returns:: The number of times the element was likely inserted
Return type:: int

check_alt(hashes)[source]

Check the count-min sketch for an element by using the hash representation

Parameters:: hashes (list) – The hashes representing the element to check
Returns:: The number of times the element was likely inserted
Return type:: int

clear()[source]

Reset the count-min sketch to an empty state

Return type:: None

property confidence: float

The confidence of the count-min sketch

Note

Not settable

Type:: float

property depth: int

The depth of the count-min sketch

Note

Not settable

Type:: int

property elements_added: int

The number of elements added to the count-min sketch

Note

Not settable

Type:: int

property error_rate: float

The error rate of the count-min sketch

Note

Not settable

Type:: float

export(file)[source]

Export the count-min sketch to disk

Parameters:

filename (str) – The filename to which the count-min sketch will be written.
file (Path | str | IOBase | mmap)

Return type:

None

classmethod frombytes(b, hash_function=None)[source]

Parameters:

b (ByteString) – The bytes to load as a Count-Min Sketch
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A count-min sketch object

Return type:

CountMinSketch

hashes(key, depth=None)[source]

Return the hashes based on the provided key

Parameters:

key (str) – Description of arg1
depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes

Returns:

A list of the hashes for the key in int form

Return type:

List(int)

join(second)[source]

Join two count-min sketchs into a single count-min sketch; the calling count-min sketch will have the resulting combined data

Parameters:

second (CountMinSketch) – The count-min sketch to merge

Raises:

TypeError – When second is not either a CountMinSketch,:class:CountMeanSketch or CountMeanMinSketch
CountMinSketchError – Raised when the count-min sketches are not of the same dimensions

Return type:

None

Note

The calling count-min sketch will contain the combined data once complete

property query_type: str

The name of the query type being used

Note

Valid values:

‘min’ or None
‘mean’
‘mean-min’

Type:: str

remove(key, num_els=1)[source]

Remove element ‘key’ from the count-min sketch

Parameters:

key (str) – The element to remove
num_els (int) – The number of times to remove the element

Returns:

The number of times the element was likely inserted after the removal

Return type:

int

remove_alt(hashes, num_els=1)[source]

Remove an element by using the hash representation

Parameters:

hashes (list) – The hashes representing the element to remove
num_els (int) – The number of times to remove the element

Returns:

The number of times the element was likely inserted after the removal

Return type:

int

property width: int

The width of the count-min sketch

Note

Not settable

Type:: int

CountMeanSketch

class probables.CountMeanSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Simple Count-Mean Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)

Parameters:

width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Mean Sketch object

Return type:

CountMeanSketch

Note

Initialization order of operations:

From file
Width and depth
Confidence and error rate

Note

Default query type is mean

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

For more information of all methods and properties, see CountMinSketch.

CountMeanMinSketch

class probables.CountMeanMinSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Simple Count-Mean-Min Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)

Parameters:

width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Mean-Min Sketch object

Return type:

CountMeanMinSketch

Note

Initialization order of operations:

From file
Width and depth
Confidence and error rate

Note

Default query type is mean-min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

For more information of all methods and properties, see CountMinSketch.

HeavyHitters

class probables.HeavyHitters(num_hitters=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Find and track those elements that are the most common, or heavy hitters

Parameters:

num_hitters (int) – The maximum number of distinct elements to track
width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Min Sketch object

Return type:

HeavyHitters

Note

Initialization order of operations:

From file
Width and depth
Confidence and error rate

Note

Default query type is min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add(key, num_els=1)[source]

Add element to heavy hitters

Parameters:

key (str) – The element to add
num_els (int) – The number of instances to add

Returns:

Number of times key has been inserted

Return type:

int

Note

Override function

add_alt(key, hashes, num_els=1)[source]

Add the element key represented as hashes to the HeavyHitters object (hence the different signature on the function!)

Parameters:

key (str) – The element to add
hashes (list) – The list of integers representing the key to insert
num_els (int) – The number of instances to add

Returns:

Number of times key has been inserted

Return type:

int

Note

Different key signature than the normal CountMinSketch

Note

Override function

clear()[source]

Clear out the heavy hitters!

Return type:: None

classmethod frombytes(b, num_hitters=100, hash_function=None)[source]

Parameters:

b (ByteString) – The bytes to load as a Expanding Bloom Filter
num_hitters (int) – The maximum number of distinct elements to track
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

HeavyHitters

property heavy_hitters: dict[str, int]

Return the heavy hitters, or most common elements

Note

Not settable

Type:: dict

join(second)[source]

Join is not supported by HeavyHitters

Raises:: NotSupportedError – This functionality is currently not supported
Parameters:: second (HeavyHitters)

property number_heavy_hitters: int

Return the maximum number of heavy hitters being tracked

Note

Not settable

Type:: int

remove_alt(hashes, num_els=1)[source]

Remove element based on hashes provided; not supported in heavy hitters

Raises:

NotSupportedError – This function is not supported by the HeavyHitters class

Parameters:

hashes (list[int])
num_els (int)

Note

Override function

For more information of all methods and properties, see CountMinSketch.

StreamThreshold

class probables.StreamThreshold(threshold=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Find and track those elements that are above a certain threshold

Parameters:

threshold (int) – The threshold above which an element will be tracked
width (int) – The width of the count-min sketch
depth (int) – The depth of the count-min sketch
confidence (float) – The level of confidence desired
error_rate (float) – The desired error rate
filepath (str) – Path to file to load
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Min Sketch object

Return type:

StreamThreshold

Note

Initialization order of operations:

From file
Width and depth
Confidence and error rate

Note

Default query type is min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add(key, num_els=1)[source]

Add the element for key into the data structure

Parameters:

key (str) – The element to add
num_els (int) – The number of instances to add

Returns:

Number of times key has been inserted

Return type:

int

Note

Override function

add_alt(key, hashes, num_els=1)[source]

Add the element for key into the data structure

Parameters:

key (str) – The element to add
hashes (list) – The list of integers representing the key to insert
num_els (int) – The number of instances to add

Returns:

Number of times key has been inserted

Return type:

int

Note

Different key signature than the normal CountMinSketch

Note

Override function

clear()[source]

Clear out the stream threshold!

Return type:: None

classmethod frombytes(b, threshold=100, hash_function=None)[source]

Parameters:

b (ByteString) – The bytes to load as a Expanding Bloom Filter
threshold (int) – The threshold above which an element will be tracked
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

StreamThreshold

join(second)[source]

Join is not supported by StreamThreshold

Raises:: NotSupportedError – This functionality is currently not supported
Parameters:: second (StreamThreshold)

property meets_threshold: dict[str, int]

Those keys that meet the required threshold (with value)

Type:: dict

remove(key, num_els=1)[source]

Remove element ‘key’ from the count-min sketch

Parameters:

key (str) – The element to remove
num_els (int) – The number of times to remove the element

Returns:

The number of times the element was likely inserted after the removal

Return type:

int

Note

Override function

remove_alt(key, hashes, num_els=1)[source]

Remove an element by using the hash representation

Parameters:

key (str) – The key that the hashes represent
hashes (list) – The hashes representing the element to remove
num_els (int) – The number of times to remove the element

Returns:

The number of times the element was likely inserted after the removal

Return type:

int

Note

Different key signature than the normal CountMinSketch

Note

Override function

property threshold: int

The threshold at which a key is tracked

Type:: int

For more information of all methods and properties, see CountMinSketch.

QuotientFilter

Quotient filters are an aproximate membership query filter (AMQ) that is both space efficient and returns a zero false negative rate and a probablistic false positive rate. Unlike Bloom filters, the quotient filter only requires a single hash of the element to insert. The upper q bits denote the location within the filter while the lower r bits are stored in the filter.

Quotient filters provide some useful benifits over Bloom filters including:

Merging of two filters (not union)
Resizing of the filter
Ability to remove elements

Further Reading

QuotientFilter

class probables.QuotientFilter(quotient=20, auto_expand=True, hash_function=None)[source]

Simple Quotient Filter implementation

Parameters:

quotient (int) – The size of the quotient to use
auto_expand (bool) – Automatically expand or not
hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

The initialized filter

Return type:

QuotientFilter

Raises:

QuotientFilterError – Raised when unable to initialize

Note

The size of the QuotientFilter will be 2**q

add(key)[source]

Add key to the quotient filter

Parameters:: key (str|bytes) – The element to add
Raises:: QuotientFilterError – Raised when no locations are available in which to insert
Return type:: None

add_alt(_hash)[source]

Add the pre-hashed value to the quotient filter

Parameters:: _hash (int) – The element to add
Raises:: QuotientFilterError – Raised when no locations are available in which to insert
Return type:: None

property auto_expand: bool

Will the quotient filter automatically expand

Type:: bool

property bits_per_elm: int

The number of bits used per element

Type:: int

check(key)[source]

Check to see if key is likely in the quotient filter

Parameters:: key (str|bytes) – The element to add
Returns:: True if likely encountered, False if definately not
Return type:: bool

check_alt(_hash)[source]

Check to see if the pre-calculated hash is likely in the quotient filter

Parameters:: _hash (int) – The element to add
Returns:: True if likely encountered, False if definately not
Return type:: bool

property elements_added: int

The number of elements added to the filter

Type:: int

get_hashes()[source]

Get the hashes from the quotient filter as a list

Returns:: The hash values stored in the quotient filter
Return type:: list(int)

hashes()[source]

A generator over the hashes in the quotient filter

Yields:: int – The next hash stored in the quotient filter
Return type:: Iterator[int]

property load_factor: float

The load factor of the filter

Type:: float

property max_load_factor: float

The maximum allowed load factor after which auto expanding should occur

Type:: float

merge(second)[source]

Merge the second quotient filter into the first

Parameters:: second (QuotientFilter) – The quotient filter to merge
Return type:: None

Note

The hashing function between the two filters should match

Note

Errors can occur if the quotient filter being inserted into does not expand (i.e., auto_expand=False)

property num_elements: int

The total size of the filter

Type:: int

print(file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

show the bits and the run/cluster/continuation/empty status, defaults to sys.stdout

Parameters:: file (TextIO)

property quotient: int

The size of the quotient, in bits

Type:: int

property remainder: int

The size of the remainder, in bits

Type:: int

remove(key)[source]

Remove key from the quotient filter

Parameters:: key (str|bytes) – The element to remove
Return type:: None

remove_alt(_hash)[source]

Remove key from the quotient filter

Parameters:: _hash (int) – The element to remove
Return type:: None

resize(quotient=None)[source]

Resize the quotient filter to use the new quotient size

Parameters:: quotient (int) – The new quotient to use
Return type:: None

Note

If None is provided, the quotient filter will double in size (quotient + 1)

Raises:: QuotientFilterError – When the new quotient will not accommodate the elements already added
Parameters:: quotient (int | None)
Return type:: None

property size: int

The number of bins available in the filter

Note

same as num_elements

Type:: int

validate_metadata(verbose=False)[source]

Check for invalid bit settings, per the wikipedia documentation

Parameters:: verbose (str) – Print which index in the quotient filter failed
Returns:: True if the metadata bits are all passing; False otherwise
Return type:: bool

Utilities

Bitarray

class probables.utilities.Bitarray(size)[source]

Simplified, pure python bitarray implementation using as little memory as possible

Parameters:

size (int) – The number of bits in the bitarray

Returns:

A bitarray

Return type:

Bitarray

Raises:

TypeError –
ValueError –

as_string()[source]

String representation of the bitarray

Returns:: Bitarray representation as a string
Return type:: str

property bitarray: array: The bitarray

check_bit(idx)[source]

Check if the bit idx is set

Parameters:: idx (int) – The index to check
Returns:: The status of the bit, either 0 or 1
Return type:: int

clear()[source]: Clear all bits in the bitarray

clear_bit(idx)[source]

Set the bit at idx to 0

Parameters:: idx (int) – The index to clear
Return type:: None

export(file)[source]

Export the bitarray to a file

Parameters:

filename (str) – Filename to export to
file (Path | str | IOBase | mmap)

Return type:

None

classmethod from_bytes(data)[source]

Convert bytes to a bitarray

Parameters:: data (bytes) – Bytes to convert
Return type:: Bitarray

is_bit_set(idx)[source]

Check if the bit idx is set

Parameters:: idx (int) – The index to check
Returns:: The status of the bit, either 0 or 1
Return type:: int

num_bits_set()[source]

Number of bits set in the bitarray

Returns:: Number of bits set
Return type:: int

set_bit(idx)[source]

Set the bit at idx to 1

Parameters:: idx (int) – The index to set
Return type:: None

property size: int: The number of bits in the bitarray

property size_bytes: int: The size of the bitarray in bytes

to_bytes()[source]

Convert the bitarray to bytes

Returns:: Bitarray representation as bytes
Return type:: bytes

Exceptions

PyProbables Exceptions

exception probables.exceptions.CountMinSketchError(message)[source]

CountMinSketch Exception

Parameters:: message (str) – The error message to be reported
Return type:: None

exception probables.exceptions.CuckooFilterFullError(message)[source]

Cuckoo Filter Full Exception

Parameters:: message (str) – The error message to be reported
Return type:: None

exception probables.exceptions.InitializationError(message)[source]

Initialization Exception

Parameters:: message (str) – The initialization error messge
Return type:: None

exception probables.exceptions.NotSupportedError(message)[source]

Not Supported Functionality Exception

Parameters:: message (str) – The error message to be reported
Return type:: None

exception probables.exceptions.ProbablesBaseException(message)[source]

Base ProbablesBaseException

Parameters:: message (str) – The error message to be reported
Return type:: None

exception probables.exceptions.QuotientFilterError(message)[source]

Quotient Filter Exception

Parameters:: message (str) – The error message to be reported
Return type:: None

exception probables.exceptions.RotatingBloomFilterError(message)[source]

RotatingBloomFilter unable to rotate Blooms Exceptions

Parameters:: message (str) – The error message to be reported
Return type:: None

exception probables.exceptions.SimilarityError(message)[source]

Similarity Exception

Parameters:: message (str) – The error message to be reported
Return type:: None

Hashing Functions

Probables Hashing Utilities

probables.hashes.default_fnv_1a(key, depth=1)[source]

The default fnv-1a hashing routine

Parameters:

key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute

Returns:

List of size depth hashes

Return type:

list(int)

probables.hashes.default_md5(key, *args, **kwargs)[source]

The default md5 hashing routine

Parameters:

key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute

Returns:

List of 64-bit hashed representation of key hashes

Return type:

list(int)

Note

Returns the upper-most 64 bits

probables.hashes.default_sha256(key, *args, **kwargs)[source]

The default sha256 hashing routine

Parameters:

key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute

Returns:

List of 64-bit hashed representation of key hashes

Return type:

list(int)

Note

Returns the upper-most 64 bits

probables.hashes.fnv_1a(key, seed=0)[source]

Pure python implementation of the 64 bit fnv-1a hash

Parameters:

key (str) – The element to be hashed
seed (int) – Add a seed to the initial starting point (0 means no seed)

Returns:

64-bit hashed representation of key

Return type:

int

Note

Uses the lower 64 bits when overflows occur

probables.hashes.fnv_1a_32(key, seed=0)[source]

Pure python implementation of the 32 bit fnv-1a hash :param key: The element to be hashed :type key: str :param seed: Add a seed to the initial starting point (0 means no seed) :type seed: int

Returns:

32-bit hashed representation of key

Return type:

int

Parameters:

key (str | bytes)
seed (int)

Note

Uses the lower 32 bits when overflows occur

probables.hashes.hash_with_depth_bytes(func)[source]

Decorator to turns a function taking a single key and hashes it to bytes. Wraps functions to be used in Bloom filters and Count-Min sketch data structures.

Parameters:

key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute
func (Callable[[str | bytes, int], bytes])

Returns:

64-bit hashed representation of key

Return type:

list(int)

Note

Arguments shown are as it will be after decorated

probables.hashes.hash_with_depth_int(func)[source]

Decorator to turn a function that takes a single key and hashes it to an int. Wraps functions to be used in Bloom filters and Count-Min sketch data structures.

Parameters:

key (str) – The element to be hashed
depth (int) – The number of hash permutations to compute
func (Callable[[str | bytes, int], list[int]] | Callable[[str | bytes, int], int])

Returns:

64-bit hashed representation of key

Return type:

list(int)

Note

Arguments shown are as it will be after decorated

pyprobables API

Data Structures and Classes

Bloom Filters

BloomFilter

BloomFilterOnDisk

ExpandingBloomFilter

RotatingBloomFilter

CountingBloomFilter

Cuckoo Filters

CuckooFilter

CountingCuckooFilter

Count-Min Sketches

CountMinSketch

CountMeanSketch

CountMeanMinSketch

HeavyHitters

StreamThreshold

QuotientFilter

QuotientFilter

Utilities

Bitarray

Exceptions

Hashing Functions

Indices and Tables