pyprobables API

Here you can find the full developer API for the pyprobables project. pyprobables provides a suite of probabilistic data-structures to be used in data analytics and data science projects.

Data Structures and Classes

Bloom Filters

Bloom Filters are a class of probabilistic data structures used for set operations. Bloom Filters guarantee a zero percent false negative rate and a predetermined false positive rate. Once the number of elements inserted exceeds the estimated elements, the false positive rate will increase over the desired amount.

Further Reading

BloomFilter

class probables.BloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]

Simple Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)

Parameters:
  • est_elements (int) – The number of estimated elements to be added

  • false_positive_rate (float) – The desired false positive rate

  • filepath (str) – Path to file to load

  • hex_string (str) – Hex based representation to be loaded

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

BloomFilter

Note

Initialization order of operations:
  1. From file

  2. From Hex String

  3. From params

add(key)[source]

Add the key to the Bloom Filter

Parameters:

key (str) – The element to be inserted

Return type:

None

add_alt(hashes)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:

hashes (list) – A list of integers representing the key to insert

Return type:

None

property bloom: array

The bit/int array

Type:

list(int)

property bloom_length: int

Length of the Bloom Filter array

Note

Not settable

Type:

int

check(key)[source]

Check if the key is likely in the Bloom Filter

Parameters:

key (str) – The element to be checked

Returns:

True if likely encountered, False if definately not

Return type:

bool

check_alt(hashes)[source]

Check if the element represented by hashes is in the Bloom Filter

Parameters:

hashes (list) – A list of integers representing the key to check

Returns:

True if likely encountered, False if definately not

Return type:

bool

clear()[source]

Clear or reset the Counting Bloom Filter

Return type:

None

current_false_positive_rate()[source]

Calculate the current false positive rate based on elements added

Returns:

The current false positive rate

Return type:

float

property elements_added: int

Number of elements added to the Bloom Filter

Note

Changing this can cause the current false positive rate to be reported incorrectly

Type:

int

estimate_elements()[source]

Estimate the number of unique elements added

Returns:

Number of elements estimated to be inserted

Return type:

int

Note

Returns -1 if all bits in the Bloom filter are set

property estimated_elements: int

The maximum number of elements estimated to be added at setup

Note

Not settable

Type:

int

export(file)[source]

Export the Bloom Filter to disk

Parameters:

file (str) – The file or filepath to which the Bloom Filter will be written.

Return type:

None

export_c_header(filename)[source]

Export the Bloom Filter to disk as a C header file.

Parameters:

filename (str) – The filename to which the Bloom Filter will be written.

Return type:

None

export_hex()[source]

Export the Bloom Filter as a hex string

Returns:

Hex representation of the Bloom Filter

Return type:

str

export_size()[source]

Calculate the size of the bloom on disk

Returns:

Size of the Bloom Filter when exported to disk

Return type:

int

property false_positive_rate: float

The maximum desired false positive rate

Note

Not settable

Type:

float

classmethod frombytes(b, hash_function=None)[source]
Parameters:
  • b (ByteString) – The bytes to load as a Bloom Filter

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

BloomFilter

property hash_function: Callable[[str | bytes, int], List[int]]

The hash function used

Note

Not settable

Type:

function

hashes(key, depth=None)[source]

Return the hashes based on the provided key

Parameters:
  • key (str) – Description of arg1

  • depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes

Returns:

A list of the hashes for the key in int form

Return type:

List(int)

intersection(second)[source]

Return a new Bloom Filter that contains the intersection of the two

Parameters:

second (BloomFilter) – The Bloom Filter with which to take the intersection

Returns:

The new Bloom Filter containing the intersection

Return type:

BloomFilter

Raises:

TypeError – When second is not either a BloomFilter or BloomFilterOnDisk

Note

second may be a BloomFilterOnDisk object

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

property is_on_disk: bool

Is the Bloom Filter on Disk or not

Note

Not settable

Type:

bool

jaccard_index(second)[source]

Calculate the jaccard similarity score between two Bloom Filters

Parameters:

second (BloomFilter) – The Bloom Filter to compare with

Returns:

A numeric value between 0 and 1 where 1 is identical and 0 means completely different

Return type:

float

Raises:

TypeError – When second is not either a BloomFilter or BloomFilterOnDisk

Note

second may be a BloomFilterOnDisk object

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

property number_bits: int

Number of bits in the Bloom Filter

Note

Not settable

Type:

int

property number_hashes: int

The number of hashes required for the Bloom Filter hashing strategy

Note

Not settable

Type:

int

union(second)[source]

Return a new Bloom Filter that contains the union of the two

Parameters:

second (BloomFilter) – The Bloom Filter with which to calculate the union

Returns:

The new Bloom Filter containing the union

Return type:

BloomFilter

Raises:

TypeError – When second is not either a BloomFilter or BloomFilterOnDisk

Note

second may be a BloomFilterOnDisk object

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

BloomFilterOnDisk

class probables.BloomFilterOnDisk(filepath, est_elements=None, false_positive_rate=None, hex_string=None, hash_function=None)[source]

Simple Bloom Filter implementation directly on disk for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)

Parameters:
  • filepath (str) – Path to file to load

  • est_elements (int) – The number of estimated elements to be added

  • false_positive_rate (float) – The desired false positive rate

  • hex_string (str) – Hex based representation to be loaded

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

BloomFilterOnDisk

Raises:

NotSupportedError – Loading using a hex string is not supported

Note

Initialization order of operations:
  1. Esimated elements and false positive rate

  2. From Hex String

  3. Only filepath provided

add_alt(hashes)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:

hashes (list) – A list of integers representing the key to insert

Return type:

None

close()[source]

Clean up the BloomFilterOnDisk object

Return type:

None

export(file)[source]

Export to disk if a different location

Parameters:

file (str|Path) – The filename to which the Bloom Filter will be exported

Return type:

None

Note

Only exported if the filename is not the original filename

classmethod frombytes(b, hash_function=None)[source]

Raises: NotSupportedError

Parameters:
  • b (ByteString) –

  • hash_function (Callable[[str | bytes, int], List[int]] | None) –

Return type:

BloomFilterOnDisk

For more information of all methods and properties, see BloomFilter.

ExpandingBloomFilter

class probables.ExpandingBloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hash_function=None)[source]

Simple expanding Bloom Filter implementation for use in python; the Bloom Fiter will automatically expand, or grow, if the false positive rate is about to become greater than the desired false positive rate.

Parameters:
  • est_elements (int) – The number of estimated elements to be added

  • false_positive_rate (float) – The desired false positive rate

  • filepath (str) – Path to file to load

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

An expanding Bloom Filter object

Return type:

ExpandingBloomFilter

Note

Initialization order of operations:
  1. Filepath

  2. est_elements and false_positive_rate

add(key, force=False)[source]

Add the key to the Bloom Filter

Parameters:
  • key (str) – The element to be inserted

  • force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

Return type:

None

add_alt(hashes, force=False)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:
  • hashes (list) – A list of integers representing the key to insert

  • force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

Return type:

None

check(key)[source]

Check to see if the key is in the Bloom Filter

Parameters:

key (str) – The key to check for in the Bloom Filter

Returns:

True if the element is likely present; False if definately not present

Return type:

bool

check_alt(hashes)[source]

Check to see if the hashes are in the Bloom Filter

Parameters:

hashes (list) – The hash representation to check for in the Bloom Filter

Returns:

True if the element is likely present; False if definately not present

Return type:

bool

property elements_added: int

The total number of elements added

Type:

int

property estimated_elements: int

The original number of elements estimated to be in the Bloom Filter

Type:

int

property expansions: int

The number of expansions

Type:

int

export(file)[source]

Export an expanding Bloom Filter, or subclass, to disk

Parameters:
  • filepath (str) – The path to the file to import

  • file (Path | str | IOBase | mmap) –

Return type:

None

property false_positive_rate: float

The desired false positive rate of the expanding Bloom Filter

Type:

float

classmethod frombytes(b, hash_function=None)[source]
Parameters:
  • b (ByteString) – The bytes to load as a Expanding Bloom Filter

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

ExpandingBloomFilter

property hash_function: Callable[[str | bytes, int], List[int]]

The total number of elements added

Type:

int

push()[source]

Push a new expansion onto the Bloom Filter

Return type:

None

RotatingBloomFilter

class probables.RotatingBloomFilter(est_elements=None, false_positive_rate=None, max_queue_size=10, filepath=None, hash_function=None)[source]

Simple Rotating Bloom Filter implementation that allows for the “older” elements added to be removed, in chunks. As the queue fills up, those elements inserted earlier will be bulk removed. This also provides the user with the oportunity to force the removal instead of it being time based.

Parameters:
  • est_elements (int) – The number of estimated elements to be added

  • false_positive_rate (float) – The desired false positive rate

  • max_queue_size (int) – This is the number is used to determine the maximum number of Bloom Filters. Total elements added is based on max_queue_size * est_elements

  • filepath (str) – Path to file to load

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Note

Initialization order of operations:
  1. Filepath

  2. est_elements and false_positive_rate

add(key, force=False)

Add the key to the Bloom Filter

Parameters:
  • key (str) – The element to be inserted

  • force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

Return type:

None

add_alt(hashes, force=False)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:
  • hashes (list) – A list of integers representing the key to insert

  • force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

Return type:

None

check(key)

Check to see if the key is in the Bloom Filter

Parameters:

key (str) – The key to check for in the Bloom Filter

Returns:

True if the element is likely present; False if definately not present

Return type:

bool

check_alt(hashes)

Check to see if the hashes are in the Bloom Filter

Parameters:

hashes (list) – The hash representation to check for in the Bloom Filter

Returns:

True if the element is likely present; False if definately not present

Return type:

bool

property current_queue_size: int

The current size of the queue

Type:

int

property elements_added: int

The total number of elements added

Type:

int

property estimated_elements: int

The original number of elements estimated to be in the Bloom Filter

Type:

int

property expansions: int

The number of expansions

Type:

int

export(file)

Export an expanding Bloom Filter, or subclass, to disk

Parameters:
  • filepath (str) – The path to the file to import

  • file (Path | str | IOBase | mmap) –

Return type:

None

property false_positive_rate: float

The desired false positive rate of the expanding Bloom Filter

Type:

float

classmethod frombytes(b, max_queue_size, hash_function=None)[source]
Parameters:
  • b (ByteString) – The bytes to load as a Expanding Bloom Filter

  • max_queue_size (int) – This is the number is used to determine the maximum number of Bloom Filters. Total elements added is based on max_queue_size * est_elements

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

RotatingBloomFilter

property hash_function: Callable[[str | bytes, int], List[int]]

The total number of elements added

Type:

int

property max_queue_size: int

The maximum size for the queue

Type:

int

pop()[source]

Pop the oldest Bloom Filter off of the queue without pushing a new Bloom Filter onto the queue

Raises:

RotatingBloomFilterError – Unable to rotate the Bloom Filter

Return type:

None

push()[source]

Push a new bloom filter onto the queue and rotate if necessary

Return type:

None

CountingBloomFilter

class probables.CountingBloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]

Simple Counting Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/counting_bloom)

Parameters:
  • est_elements (int) – The number of estimated elements to be added

  • false_positive_rate (float) – The desired false positive rate

  • filepath (str) – Path to file to load

  • hex_string (str) – Hex based representation to be loaded

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Counting Bloom Filter object

Return type:

CountingBloomFilter

Note

Initialization order of operations:
  1. From file

  2. From Hex String

  3. From params

add(key, num_els=1)[source]

Add the key to the Counting Bloom Filter

Parameters:
  • key (str) – The element to be inserted

  • num_els (int) – Number of times to insert the element

Returns:

Maximum number of insertions

Return type:

int

add_alt(hashes, num_els=1)[source]

Add the element represented by hashes into the Counting Bloom Filter

Parameters:
  • hashes (list) – A list of integers representing the key to insert

  • num_els (int) – Number of times to insert the element

Returns:

Maximum number of insertions

Return type:

int

property bloom: array

The bit/int array

Type:

list(int)

property bloom_length: int

Length of the Bloom Filter array

Note

Not settable

Type:

int

check(key)[source]

Check if the key is likely in the Counting Bloom Filter

Parameters:

key (str) – The element to be checked

Returns:

Maximum number of insertions

Return type:

int

check_alt(hashes)[source]

Check if the element represented by hashes is in the Counting Bloom Filter

Parameters:

hashes (list) – A list of integers representing the key to check

Returns:

Maximum number of insertions

Return type:

int

clear()

Clear or reset the Counting Bloom Filter

Return type:

None

current_false_positive_rate()

Calculate the current false positive rate based on elements added

Returns:

The current false positive rate

Return type:

float

property elements_added: int

Number of elements added to the Bloom Filter

Note

Changing this can cause the current false positive rate to be reported incorrectly

Type:

int

estimate_elements()

Estimate the number of unique elements added

Returns:

Number of elements estimated to be inserted

Return type:

int

Note

Returns -1 if all bits in the Bloom filter are set

property estimated_elements: int

The maximum number of elements estimated to be added at setup

Note

Not settable

Type:

int

export(file)

Export the Bloom Filter to disk

Parameters:

file (str) – The file or filepath to which the Bloom Filter will be written.

Return type:

None

export_c_header(filename)

Export the Bloom Filter to disk as a C header file.

Parameters:

filename (str) – The filename to which the Bloom Filter will be written.

Return type:

None

export_hex()

Export the Bloom Filter as a hex string

Returns:

Hex representation of the Bloom Filter

Return type:

str

export_size()

Calculate the size of the bloom on disk

Returns:

Size of the Bloom Filter when exported to disk

Return type:

int

property false_positive_rate: float

The maximum desired false positive rate

Note

Not settable

Type:

float

classmethod frombytes(b, hash_function=None)[source]
Parameters:
  • b (ByteString) – the bytes to load as a Counting Bloom Filter

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Counting Bloom Filter object

Return type:

CountingBloomFilter

property hash_function: Callable[[str | bytes, int], List[int]]

The hash function used

Note

Not settable

Type:

function

hashes(key, depth=None)

Return the hashes based on the provided key

Parameters:
  • key (str) – Description of arg1

  • depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes

Returns:

A list of the hashes for the key in int form

Return type:

List(int)

intersection(second)[source]

Take the intersection of two Counting Bloom Filters

Parameters:

second (CountingBloomFilter) – The Bloom Filter with which to take the intersection

Returns:

The new Counting Bloom Filter containing the union

Return type:

CountingBloomFilter

Raises:

TypeError – When second is not a CountingBloomFilter

Note

The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

property is_on_disk: bool

Is the Bloom Filter on Disk or not

Note

Not settable

Type:

bool

jaccard_index(second)[source]

Take the Jaccard Index of two Counting Bloom Filters

Parameters:

second (CountingBloomFilter) – The Bloom Filter with which to take the jaccard index

Returns:

A numeric value between 0 and 1 where 1 is identical and 0 means completely different

Return type:

float

Raises:

TypeError – When second is not a CountingBloomFilter

Note

The Jaccard Index is based on the unique set of elements added and not the number of each element added

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

property number_bits: int

Number of bits in the Bloom Filter

Note

Not settable

Type:

int

property number_hashes: int

The number of hashes required for the Bloom Filter hashing strategy

Note

Not settable

Type:

int

remove(key, num_els=1)[source]

Remove the element from the counting bloom

Parameters:
  • key (str) – The element to be removed

  • num_els (int) – Number of times to remove the element

Returns:

Maximum number of insertions after the removal

Return type:

int

remove_alt(hashes, num_els=1)[source]

Remvoe the element represented by hashes from the Counting Bloom Filter

Parameters:
  • hashes (list) – A list of integers representing the key to remove

  • num_els (int) – Number of times to remove the element

Returns:

Maximum number of insertions after the removal

Return type:

int

union(second)[source]

Return a new Countiong Bloom Filter that contains the union of the two

Parameters:

second (CountingBloomFilter) – The Counting Bloom Filter with which to calculate the union

Returns:

The new Counting Bloom Filter containing the union

Return type:

CountingBloomFilter

Raises:

TypeError – When second is not a CountingBloomFilter

Note

The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

Cuckoo Filters

Cuckoo filters are a space efficient data structure that supports set membership testing. Cuckoo filters support insertion, deletion, and lookup of elements with low overhead and few false positive results. The name is derived from the cuckoo hashing strategy used to resolve conflicts.

Further Reading

CuckooFilter

class probables.CuckooFilter(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]

Simple Cuckoo Filter implementation

Parameters:
  • capacity (int) – The number of bins

  • bucket_size (int) – The number of buckets per bin

  • max_swaps (int) – The number of cuckoo swaps before stopping

  • expansion_rate (int) – The rate at which to expand

  • auto_expand (bool) – If the filter should automatically expand

  • finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import

  • filepath (str) – The path to the file to load or None if no file

  • hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

add(key)[source]

Add element key to the filter

Parameters:

key (str) – The element to add

Raises:

CuckooFilterFullError – When element not inserted after maximum number of swaps or ‘kicks’

property auto_expand: bool

True if the cuckoo filter will expand automatically

Type:

bool

property bucket_size: int

The number of buckets per bin

Note

Not settable

Type:

int

property buckets: List[List[int]]

The buckets holding the fingerprints

Note

Not settable

Type:

list(list)

property capacity: int

The number of bins

Note

Not settable

Type:

int

check(key)[source]

Check if an element is in the filter

Parameters:

key (str) – Element to check

Returns:

True if likely present, False if definately not

Return type:

bool

property elements_added: int

The number of elements added

Note

Not settable

Type:

int

property error_rate: float

The error rate of the cuckoo filter

Type:

float

expand()[source]

Expand the cuckoo filter

property expansion_rate: int

The rate at expansion when the filter grows

Type:

int

export(file)[source]

Export cuckoo filter to file

Parameters:

file (Path | str | IOBase | mmap) – Path to file to export

Return type:

None

property fingerprint_size: int

The size in bytes of the fingerprint

Raises:

ValueError – If the size is not between 1 and 4

Note

The size of the fingerprint must be between 1 and 4

Type:

int

property fingerprint_size_bits: int

The size in bits of the fingerprint

Type:

int

classmethod frombytes(b, error_rate=None, hash_function=None)[source]
Parameters:
  • b (ByteString) – The bytes to load as a Expanding Bloom Filter

  • error_rate (float) – The error rate of the cuckoo filter, if used to generate the original filter

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

CuckooFilter

classmethod init_error_rate(error_rate, capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, hash_function=None)[source]

Initialize a simple Cuckoo Filter based on error rate

Parameters:
  • error_rate (float) –

  • capacity (int) – The number of bins

  • bucket_size (int) – The number of buckets per bin

  • max_swaps (int) – The number of cuckoo swaps before stopping

  • expansion_rate (int) – The rate at which to expand

  • auto_expand (bool) – If the filter should automatically expand

  • hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

classmethod load_error_rate(error_rate, filepath, hash_function=None)[source]

Initialize a previously exported Cuckoo Filter based on error rate

Parameters:
  • error_rate (float) –

  • filepath (str) – The path to the file to load or None if no file

  • hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

load_factor()[source]

float: How full the Cuckoo Filter is currently

Return type:

float

property max_swaps: int

The maximum number of swaps to perform

Note

Not settable

Type:

int

remove(key)[source]

Remove an element from the filter

Parameters:

key (str) – Element to remove

Returns:

True if removed, False if not present

Return type:

bool

CountingCuckooFilter

class probables.CountingCuckooFilter(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]

Simple Counting Cuckoo Filter implementation

Parameters:
  • capacity (int) – The number of bins

  • bucket_size (int) – The number of buckets per bin

  • max_swaps (int) – The number of cuckoo swaps before stopping

  • expansion_rate (int) – The rate at which to expand

  • auto_expand (bool) – If the filter should automatically expand

  • finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import

  • filename (str) – The path to the file to load or None if no file

  • filepath (str | Path | None) –

  • hash_function (Callable[[str | bytes, int], int] | None) –

Returns:

A Cuckoo Filter object

Return type:

CountingCuckooFilter

add(key)[source]

Add element key to the filter

Parameters:

key (str) – The element to add

Raises:

CuckooFilterFullError – When element not inserted after maximum number of swaps or ‘kicks’

Return type:

None

property auto_expand: bool

True if the cuckoo filter will expand automatically

Type:

bool

property bucket_size: int

The number of buckets per bin

Note

Not settable

Type:

int

property buckets: List[List[CountingCuckooBin]]

The buckets holding the fingerprints

Note

Not settable

Type:

list(list)

property capacity: int

The number of bins

Note

Not settable

Type:

int

check(key)[source]

Check if an element is in the filter

Parameters:

key (str) – Element to check

Returns:

The number of times inserted into the filter

Return type:

int

property elements_added: int

The number of elements added

Note

Not settable

Type:

int

property error_rate: float

The error rate of the cuckoo filter

Type:

float

expand()[source]

Expand the cuckoo filter

property expansion_rate: int

The rate at expansion when the filter grows

Type:

int

export(file)[source]

Export cuckoo filter to file

Parameters:

file (str) – Path to file to export

Return type:

None

property fingerprint_size: int

The size in bytes of the fingerprint

Raises:

ValueError – If the size is not between 1 and 4

Note

The size of the fingerprint must be between 1 and 4

Type:

int

property fingerprint_size_bits: int

The size in bits of the fingerprint

Type:

int

classmethod frombytes(b, error_rate=None, hash_function=None)[source]
Parameters:
  • b (ByteString) – The bytes to load as a Expanding Bloom Filter

  • error_rate (float) – The error rate of the cuckoo filter, if used to generate the original filter

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

CountingCuckooFilter

classmethod init_error_rate(error_rate, capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, hash_function=None)[source]

Initialize a simple Cuckoo Filter based on error rate

Parameters:
  • error_rate (float) –

  • capacity (int) – The number of bins

  • bucket_size (int) – The number of buckets per bin

  • max_swaps (int) – The number of cuckoo swaps before stopping

  • expansion_rate (int) – The rate at which to expand

  • auto_expand (bool) – If the filter should automatically expand

  • hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

classmethod load_error_rate(error_rate, filepath, hash_function=None)[source]

Initialize a previously exported Cuckoo Filter based on error rate

Parameters:
  • error_rate (float) –

  • filepath (str) – The path to the file to load or None if no file

  • hash_function (function) – Hashing strategy function to use hf(key)

Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

load_factor()[source]

float: How full the Cuckoo Filter is currently

Return type:

float

property max_swaps: int

The maximum number of swaps to perform

Note

Not settable

Type:

int

remove(key)[source]

Remove an element from the filter

Parameters:

key (str) – Element to remove

Return type:

bool

property unique_elements: int

unique number of elements inserted

Type:

int

Count-Min Sketches

Count-Min Sketches, and its derivatives, are good for estimating the number of occurrences of an element in streaming data while not needing to retain all the data elements. The result is a probabilistic count of elements inserted into the data structure. It will always provide the maximum number of times a data element was encountered. Notice that the result may be more than the true number of times it was inserted, but never fewer.

Further Reading

CountMinSketch

class probables.CountMinSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Simple Count-Min Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)

Parameters:
  • width (int) – The width of the count-min sketch

  • depth (int) – The depth of the count-min sketch

  • confidence (float) – The level of confidence desired

  • error_rate (float) – The desired error rate

  • filepath (str) – Path to file to load

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Min Sketch object

Return type:

CountMinSketch

Note

Initialization order of operations:
  1. From file

  2. Width and depth

  3. Confidence and error rate

Note

Default query type is min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add(key, num_els=1)[source]

Insert the element key into the count-min sketch

Parameters:
  • key (str) – The element to insert

  • num_els (int) – The number of times to insert the element

Returns:

The number of times the element was likely inserted after the insertion

Return type:

int

add_alt(hashes, num_els=1)[source]

Insert an element by using the hash representation

Parameters:
  • key (str) – The element to insert

  • num_els (int) – The number of times to insert the element

  • hashes (List[int]) –

Returns:

The number of times the element was likely inserted after the insertion

Return type:

int

check(key)[source]

Check number of times element ‘key’ is in the count-min sketch

Parameters:

key (str) – The key to check the number of times inserted

Returns:

The number of times the element was likely inserted

Return type:

int

check_alt(hashes)[source]

Check the count-min sketch for an element by using the hash representation

Parameters:

hashes (list) – The hashes representing the element to check

Returns:

The number of times the element was likely inserted

Return type:

int

clear()[source]

Reset the count-min sketch to an empty state

Return type:

None

property confidence: float

The confidence of the count-min sketch

Note

Not settable

Type:

float

property depth: int

The depth of the count-min sketch

Note

Not settable

Type:

int

property elements_added: int

The number of elements added to the count-min sketch

Note

Not settable

Type:

int

property error_rate: float

The error rate of the count-min sketch

Note

Not settable

Type:

float

export(file)[source]

Export the count-min sketch to disk

Parameters:
  • filename (str) – The filename to which the count-min sketch will be written.

  • file (Path | str | IOBase | mmap) –

Return type:

None

classmethod frombytes(b, hash_function=None)[source]
Parameters:
  • b (ByteString) – The bytes to load as a Count-Min Sketch

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A count-min sketch object

Return type:

CountMinSketch

hashes(key, depth=None)[source]

Return the hashes based on the provided key

Parameters:
  • key (str) – Description of arg1

  • depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes

Returns:

A list of the hashes for the key in int form

Return type:

List(int)

join(second)[source]

Join two count-min sketchs into a single count-min sketch; the calling count-min sketch will have the resulting combined data

Parameters:

second (CountMinSketch) – The count-min sketch to merge

Raises:
Return type:

None

Note

The calling count-min sketch will contain the combined data once complete

property query_type: str

The name of the query type being used

Note

Valid values:
  • ‘min’ or None

  • ‘mean’

  • ‘mean-min’

Type:

str

remove(key, num_els=1)[source]

Remove element ‘key’ from the count-min sketch

Parameters:
  • key (str) – The element to remove

  • num_els (int) – The number of times to remove the element

Returns:

The number of times the element was likely inserted after the removal

Return type:

int

remove_alt(hashes, num_els=1)[source]

Remove an element by using the hash representation

Parameters:
  • hashes (list) – The hashes representing the element to remove

  • num_els (int) – The number of times to remove the element

Returns:

The number of times the element was likely inserted after the removal

Return type:

int

property width: int

The width of the count-min sketch

Note

Not settable

Type:

int

CountMeanSketch

class probables.CountMeanSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Simple Count-Mean Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)

Parameters:
  • width (int) – The width of the count-min sketch

  • depth (int) – The depth of the count-min sketch

  • confidence (float) – The level of confidence desired

  • error_rate (float) – The desired error rate

  • filepath (str) – Path to file to load

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Mean Sketch object

Return type:

CountMeanSketch

Note

Initialization order of operations:
  1. From file

  2. Width and depth

  3. Confidence and error rate

Note

Default query type is mean

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

For more information of all methods and properties, see CountMinSketch.

CountMeanMinSketch

class probables.CountMeanMinSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Simple Count-Mean-Min Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)

Parameters:
  • width (int) – The width of the count-min sketch

  • depth (int) – The depth of the count-min sketch

  • confidence (float) – The level of confidence desired

  • error_rate (float) – The desired error rate

  • filepath (str) – Path to file to load

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Mean-Min Sketch object

Return type:

CountMeanMinSketch

Note

Initialization order of operations:
  1. From file

  2. Width and depth

  3. Confidence and error rate

Note

Default query type is mean-min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

For more information of all methods and properties, see CountMinSketch.

HeavyHitters

class probables.HeavyHitters(num_hitters=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Find and track those elements that are the most common, or heavy hitters

Parameters:
  • num_hitters (int) – The maximum number of distinct elements to track

  • width (int) – The width of the count-min sketch

  • depth (int) – The depth of the count-min sketch

  • confidence (float) – The level of confidence desired

  • error_rate (float) – The desired error rate

  • filepath (str) – Path to file to load

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Min Sketch object

Return type:

HeavyHitters

Note

Initialization order of operations:
  1. From file

  2. Width and depth

  3. Confidence and error rate

Note

Default query type is min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add(key, num_els=1)[source]

Add element to heavy hitters

Parameters:
  • key (str) – The element to add

  • num_els (int) – The number of instances to add

Returns:

Number of times key has been inserted

Return type:

int

Note

Override function

add_alt(key, hashes, num_els=1)[source]

Add the element key represented as hashes to the HeavyHitters object (hence the different signature on the function!)

Parameters:
  • key (str) – The element to add

  • hashes (list) – The list of integers representing the key to insert

  • num_els (int) – The number of instances to add

Returns:

Number of times key has been inserted

Return type:

int

Note

Different key signature than the normal CountMinSketch

Note

Override function

clear()[source]

Clear out the heavy hitters!

Return type:

None

classmethod frombytes(b, num_hitters=100, hash_function=None)[source]
Parameters:
  • b (ByteString) – The bytes to load as a Expanding Bloom Filter

  • num_hitters (int) – The maximum number of distinct elements to track

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

HeavyHitters

property heavy_hitters: Dict[str, int]

Return the heavy hitters, or most common elements

Note

Not settable

Type:

dict

join(second)[source]

Join is not supported by HeavyHitters

Raises:

NotSupportedError – This functionality is currently not supported

Parameters:

second (HeavyHitters) –

property number_heavy_hitters: int

Return the maximum number of heavy hitters being tracked

Note

Not settable

Type:

int

remove_alt(hashes, num_els=1)[source]

Remove element based on hashes provided; not supported in heavy hitters

Raises:

NotSupportedError – This function is not supported by the HeavyHitters class

Parameters:
  • hashes (List[int]) –

  • num_els (int) –

Note

Override function

For more information of all methods and properties, see CountMinSketch.

StreamThreshold

class probables.StreamThreshold(threshold=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Find and track those elements that are above a certain threshold

Parameters:
  • threshold (int) – The threshold above which an element will be tracked

  • width (int) – The width of the count-min sketch

  • depth (int) – The depth of the count-min sketch

  • confidence (float) – The level of confidence desired

  • error_rate (float) – The desired error rate

  • filepath (str) – Path to file to load

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Count-Min Sketch object

Return type:

StreamThreshold

Note

Initialization order of operations:
  1. From file

  2. Width and depth

  3. Confidence and error rate

Note

Default query type is min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add(key, num_els=1)[source]

Add the element for key into the data structure

Parameters:
  • key (str) – The element to add

  • num_els (int) – The number of instances to add

Returns:

Number of times key has been inserted

Return type:

int

Note

Override function

add_alt(key, hashes, num_els=1)[source]

Add the element for key into the data structure

Parameters:
  • key (str) – The element to add

  • hashes (list) – The list of integers representing the key to insert

  • num_els (int) – The number of instances to add

Returns:

Number of times key has been inserted

Return type:

int

Note

Different key signature than the normal CountMinSketch

Note

Override function

clear()[source]

Clear out the stream threshold!

Return type:

None

classmethod frombytes(b, threshold=100, hash_function=None)[source]
Parameters:
  • b (ByteString) – The bytes to load as a Expanding Bloom Filter

  • threshold (int) – The threshold above which an element will be tracked

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

A Bloom Filter object

Return type:

StreamThreshold

join(second)[source]

Join is not supported by StreamThreshold

Raises:

NotSupportedError – This functionality is currently not supported

Parameters:

second (StreamThreshold) –

property meets_threshold: Dict[str, int]

Those keys that meet the required threshold (with value)

Type:

dict

remove(key, num_els=1)[source]

Remove element ‘key’ from the count-min sketch

Parameters:
  • key (str) – The element to remove

  • num_els (int) – The number of times to remove the element

Returns:

The number of times the element was likely inserted after the removal

Return type:

int

Note

Override function

remove_alt(key, hashes, num_els=1)[source]

Remove an element by using the hash representation

Parameters:
  • key (str) – The key that the hashes represent

  • hashes (list) – The hashes representing the element to remove

  • num_els (int) – The number of times to remove the element

Returns:

The number of times the element was likely inserted after the removal

Return type:

int

Note

Different key signature than the normal CountMinSketch

Note

Override function

property threshold: int

The threshold at which a key is tracked

Type:

int

For more information of all methods and properties, see CountMinSketch.

QuotientFilter

Quotient filters are an aproximate membership query filter (AMQ) that is both space efficient and returns a zero false negative rate and a probablistic false positive rate. Unlike Bloom filters, the quotient filter only requires a single hash of the element to insert. The upper q bits denote the location within the filter while the lower r bits are stored in the filter.

Quotient filters provide some useful benifits over Bloom filters including:

  • Merging of two filters (not union)

  • Resizing of the filter

  • Ability to remove elements

Further Reading

QuotientFilter

class probables.QuotientFilter(quotient=20, auto_expand=True, hash_function=None)[source]

Simple Quotient Filter implementation

Parameters:
  • quotient (int) – The size of the quotient to use

  • auto_expand (bool) – Automatically expand or not

  • hash_function (function) – Hashing strategy function to use hf(key, number)

Returns:

The initialized filter

Return type:

QuotientFilter

Raises:

QuotientFilterError – Raised when unable to initialize

Note

The size of the QuotientFilter will be 2**q

add(key)[source]

Add key to the quotient filter

Parameters:

key (str|bytes) – The element to add

Raises:

QuotientFilterError – Raised when no locations are available in which to insert

Return type:

None

add_alt(_hash)[source]

Add the pre-hashed value to the quotient filter

Parameters:

_hash (int) – The element to add

Raises:

QuotientFilterError – Raised when no locations are available in which to insert

Return type:

None

property auto_expand: bool

Will the quotient filter automatically expand

Type:

bool

property bits_per_elm: int

The number of bits used per element

Type:

int

check(key)[source]

Check to see if key is likely in the quotient filter

Parameters:

key (str|bytes) – The element to add

Returns:

True if likely encountered, False if definately not

Return type:

bool

check_alt(_hash)[source]

Check to see if the pre-calculated hash is likely in the quotient filter

Parameters:

_hash (int) – The element to add

Returns:

True if likely encountered, False if definately not

Return type:

bool

property elements_added: int

The number of elements added to the filter

Type:

int

get_hashes()[source]

Get the hashes from the quotient filter as a list

Returns:

The hash values stored in the quotient filter

Return type:

list(int)

hashes()[source]

A generator over the hashes in the quotient filter

Yields:

int – The next hash stored in the quotient filter

Return type:

Iterator[int]

property load_factor: float

The load factor of the filter

Type:

float

property max_load_factor: float

The maximum allowed load factor after which auto expanding should occur

Type:

float

merge(second)[source]

Merge the second quotient filter into the first

Parameters:

second (QuotientFilter) – The quotient filter to merge

Return type:

None

Note

The hashing function between the two filters should match

Note

Errors can occur if the quotient filter being inserted into does not expand (i.e., auto_expand=False)

property num_elements: int

The total size of the filter

Type:

int

print()[source]

show the bits and the run/cluster/continuation/empty status

property quotient: int

The size of the quotient, in bits

Type:

int

property remainder: int

The size of the remainder, in bits

Type:

int

resize(quotient=None)[source]

Resize the quotient filter to use the new quotient size

Parameters:

quotient (int) – The new quotient to use

Return type:

None

Note

If None is provided, the quotient filter will double in size (quotient + 1)

Raises:

QuotientFilterError – When the new quotient will not accommodate the elements already added

Parameters:

quotient (int | None) –

Return type:

None

property size: int

The number of bins available in the filter

Note

same as num_elements

Type:

int

validate_metadata()[source]

check for invalid bit settings

Utilities

Bitarray

class probables.utilities.Bitarray(size)[source]

Simplified, pure python bitarray implementation using as little memory as possible

Parameters:

size (int) – The number of bits in the bitarray

Returns:

A bitarray

Return type:

Bitarray

Raises:
  • TypeError

  • ValueError

as_string()[source]

String representation of the bitarray

Returns:

Bitarray representation as a string

Return type:

str

property bitarray: array

The bitarray

check_bit(idx)[source]

Check if the bit idx is set

Parameters:

idx (int) – The index to check

Returns:

The status of the bit, either 0 or 1

Return type:

int

clear()[source]

Clear all bits in the bitarray

clear_bit(idx)[source]

Set the bit at idx to 0

Parameters:

idx (int) – The index to clear

Return type:

None

is_bit_set(idx)[source]

Check if the bit idx is set

Parameters:

idx (int) – The index to check

Returns:

The status of the bit, either 0 or 1

Return type:

int

num_bits_set()[source]

Number of bits set in the bitarray

Returns:

Number of bits set

Return type:

int

set_bit(idx)[source]

Set the bit at idx to 1

Parameters:

idx (int) – The index to set

Return type:

None

property size: int

The number of bits in the bitarray

property size_bytes: int

The size of the bitarray in bytes

Exceptions

PyProbables Exceptions

exception probables.exceptions.CountMinSketchError(message)[source]

CountMinSketch Exception

Parameters:

message (str) – The error message to be reported

Return type:

None

exception probables.exceptions.CuckooFilterFullError(message)[source]

Cuckoo Filter Full Exception

Parameters:

message (str) – The error message to be reported

Return type:

None

exception probables.exceptions.InitializationError(message)[source]

Initialization Exception

Parameters:

message (str) – The initialization error messge

Return type:

None

exception probables.exceptions.NotSupportedError(message)[source]

Not Supported Functionality Exception

Parameters:

message (str) – The error message to be reported

Return type:

None

exception probables.exceptions.ProbablesBaseException(message)[source]

Base ProbablesBaseException

Parameters:

message (str) – The error message to be reported

Return type:

None

exception probables.exceptions.QuotientFilterError(message)[source]

Quotient Filter Exception

Parameters:

message (str) – The error message to be reported

Return type:

None

exception probables.exceptions.RotatingBloomFilterError(message)[source]

RotatingBloomFilter unable to rotate Blooms Exceptions

Parameters:

message (str) – The error message to be reported

Return type:

None

Hashing Functions

Probables Hashing Utilities

probables.hashes.default_fnv_1a(key, depth=1)[source]

The default fnv-1a hashing routine

Parameters:
  • key (str) – The element to be hashed

  • depth (int) – The number of hash permutations to compute

Returns:

List of size depth hashes

Return type:

list(int)

probables.hashes.default_md5(key, *args, **kwargs)[source]

The default md5 hashing routine

Parameters:
  • key (str) – The element to be hashed

  • depth (int) – The number of hash permutations to compute

Returns:

List of 64-bit hashed representation of key hashes

Return type:

list(int)

Note

Returns the upper-most 64 bits

probables.hashes.default_sha256(key, *args, **kwargs)[source]

The default sha256 hashing routine

Parameters:
  • key (str) – The element to be hashed

  • depth (int) – The number of hash permutations to compute

Returns:

List of 64-bit hashed representation of key hashes

Return type:

list(int)

Note

Returns the upper-most 64 bits

probables.hashes.fnv_1a(key, seed=0)[source]

Pure python implementation of the 64 bit fnv-1a hash

Parameters:
  • key (str) – The element to be hashed

  • seed (int) – Add a seed to the initial starting point (0 means no seed)

Returns:

64-bit hashed representation of key

Return type:

int

Note

Uses the lower 64 bits when overflows occur

probables.hashes.fnv_1a_32(key, seed=0)[source]

Pure python implementation of the 32 bit fnv-1a hash :param key: The element to be hashed :type key: str :param seed: Add a seed to the initial starting point (0 means no seed) :type seed: int

Returns:

32-bit hashed representation of key

Return type:

int

Parameters:
  • key (str | bytes) –

  • seed (int) –

Note

Uses the lower 32 bits when overflows occur

probables.hashes.hash_with_depth_bytes(func)[source]

Decorator to turns a function taking a single key and hashes it to bytes. Wraps functions to be used in Bloom filters and Count-Min sketch data structures.

Parameters:
  • key (str) – The element to be hashed

  • depth (int) – The number of hash permutations to compute

  • func (Callable[[str | bytes, int], bytes]) –

Returns:

64-bit hashed representation of key

Return type:

list(int)

Note

Arguments shown are as it will be after decorated

probables.hashes.hash_with_depth_int(func)[source]

Decorator to turn a function that takes a single key and hashes it to an int. Wraps functions to be used in Bloom filters and Count-Min sketch data structures.

Parameters:
  • key (str) – The element to be hashed

  • depth (int) – The number of hash permutations to compute

  • func (Callable[[str | bytes, int], List[int]]) –

Returns:

64-bit hashed representation of key

Return type:

list(int)

Note

Arguments shown are as it will be after decorated

Indices and Tables