pyprobables API

Here you can find the full developer API for the pyprobables project. pyprobables provides a suite of probabilistic data-structures to be used in data analytics and data science projects.

Data Structures and Classes

Bloom Filters

Bloom Filters are a class of probabilistic data structures used for set operations. Bloom Filters guarantee a zero percent false negative rate and a predetermined false positive rate. Once the number of elements inserted exceeds the estimated elements, the false positive rate will increase over the desired amount. Further Reading

BloomFilter

class probables.BloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]

Simple Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)

Parameters:
  • est_elements (int) – The number of estimated elements to be added
  • false_positive_rate (float) – The desired false positive rate
  • filepath (str) – Path to file to load
  • hex_string (str) – Hex based representation to be loaded
  • hash_function (function) – Hashing strategy function to use hf(key, number)
Returns:

A Bloom Filter object

Return type:

BloomFilter

Note

Initialization order of operations:
  1. From file
  2. From Hex String
  3. From params
add(key)

Add the key to the Bloom Filter

Parameters:key (str) – The element to be inserted
add_alt(hashes)

Add the element represented by hashes into the Bloom Filter

Parameters:hashes (list) – A list of integers representing the key to insert
bloom

The bit/int array

Type:list(int)
bloom_length

Length of the Bloom Filter array

Note

Not settable

Type:int
check(key)

Check if the key is likely in the Bloom Filter

Parameters:key (str) – The element to be checked
Returns:True if likely encountered, False if definately not
Return type:bool
check_alt(hashes)

Check if the element represented by hashes is in the Bloom Filter

Parameters:hashes (list) – A list of integers representing the key to check
Returns:True if likely encountered, False if definately not
Return type:bool
clear()

Clear or reset the Counting Bloom Filter

current_false_positive_rate()

Calculate the current false positive rate based on elements added

Returns:The current false positive rate
Return type:float
elements_added

Number of elements added to the Bloom Filter

Note

Changing this can cause the current false positive rate to be reported incorrectly

Type:int
estimate_elements()

Estimate the number of unique elements added

Returns:Number of elements estimated to be inserted
Return type:int
estimated_elements

The maximum number of elements estimated to be added at setup

Note

Not settable

Type:int
export(filename)

Export the Bloom Filter to disk

Parameters:filename (str) – The filename to which the Bloom Filter will be written.
export_hex()

Export the Bloom Filter as a hex string

Returns:Hex representation of the Bloom Filter
Return type:str
export_size()

Calculate the size of the bloom on disk

Returns:Size of the Bloom Filter when exported to disk
Return type:int
false_positive_rate

The maximum desired false positive rate

Note

Not settable

Type:float
hash_function

The hash function used

Note

Not settable

Type:function
hashes(key, depth=None)

Return the hashes based on the provided key

Parameters:
  • key (str) – Description of arg1
  • depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes
Returns:

A list of the hashes for the key in int form

Return type:

List(int)

intersection(second)[source]

Return a new Bloom Filter that contains the intersection of the two

Parameters:second (BloomFilter) – The Bloom Filter with which to take the intersection
Returns:The new Bloom Filter containing the intersection
Return type:BloomFilter
Raises:TypeError – When second is not either a BloomFilter or BloomFilterOnDisk

Note

second may be a BloomFilterOnDisk object

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

is_on_disk

Is the Bloom Filter on Disk or not

Note

Not settable

Type:bool
jaccard_index(second)[source]

Calculate the jaccard similarity score between two Bloom Filters

Parameters:second (BloomFilter) – The Bloom Filter to compare with
Returns:A numeric value between 0 and 1 where 1 is identical and 0 means completely different
Return type:float
Raises:TypeError – When second is not either a BloomFilter or BloomFilterOnDisk

Note

second may be a BloomFilterOnDisk object

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

number_bits

Number of bits in the Bloom Filter

Note

Not settable

Type:int
number_hashes

The number of hashes required for the Bloom Filter hashing strategy

Note

Not settable

Type:int
union(second)[source]

Return a new Bloom Filter that contains the union of the two

Parameters:second (BloomFilter) – The Bloom Filter with which to calculate the union
Returns:The new Bloom Filter containing the union
Return type:BloomFilter
Raises:TypeError – When second is not either a BloomFilter or BloomFilterOnDisk

Note

second may be a BloomFilterOnDisk object

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

BloomFilterOnDisk

class probables.BloomFilterOnDisk(filepath, est_elements=None, false_positive_rate=None, hex_string=None, hash_function=None)[source]

Simple Bloom Filter implementation directly on disk for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)

Parameters:
  • filepath (str) – Path to file to load
  • est_elements (int) – The number of estimated elements to be added
  • false_positive_rate (float) – The desired false positive rate
  • hex_string (str) – Hex based representation to be loaded
  • hash_function (function) – Hashing strategy function to use hf(key, number)
Returns:

A Bloom Filter object

Return type:

BloomFilterOnDisk

Raises:

NotSupportedError – Loading using a hex string is not supported

Note

Initialization order of operations:
  1. Esimated elements and false positive rate
  2. From Hex String
  3. Only filepath provided
add_alt(hashes)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:hashes (list) – A list of integers representing the key to insert
close()[source]

Clean up the BloomFilterOnDisk object

export(filename)[source]

Export to disk if a different location

Parameters:filename (str) – The filename to which the Bloom Filter will be exported

Note

Only exported if the filename is not the original filename

export_hex()[source]

Export to a hex string

Raises:NotSupportedError – This functionality is currently not supported
intersection(second)[source]

Return a new Bloom Filter that contains the intersection of the two

Parameters:second (BloomFilter) – The Bloom Filter with which to take the intersection
Returns:The new Bloom Filter containing the intersection
Return type:BloomFilter
Raises:TypeError – When second is not either a BloomFilter or BloomFilterOnDisk

Note

second may be a BloomFilter object

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

jaccard_index(second)[source]

Calculate the jaccard similarity score between two Bloom Filters

Parameters:second (BloomFilter) – The Bloom Filter to compare with
Returns:A numeric value between 0 and 1 where 1 is identical and 0 means completely different
Return type:float
Raises:TypeError – When second is not either a BloomFilter or BloomFilterOnDisk

Note

second may be a BloomFilter object

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

union(second)[source]

Return a new Bloom Filter that contains the union of the two

Parameters:second (BloomFilter) – The Bloom Filter with which to calculate the union
Returns:The new Bloom Filter containing the union
Return type:BloomFilter
Raises:TypeError – When second is not either a BloomFilter or BloomFilterOnDisk

Note

second may be a BloomFilter object

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

For more information of all methods and properties, see BloomFilter.

ExpandingBloomFilter

class probables.ExpandingBloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hash_function=None)[source]

Simple expanding Bloom Filter implementation for use in python; the Bloom Fiter will automatically expand, or grow, if the false positive rate is about to become greater than the desired false positive rate.

Parameters:
  • est_elements (int) – The number of estimated elements to be added
  • false_positive_rate (float) – The desired false positive rate
  • filepath (str) – Path to file to load
  • hash_function (function) – Hashing strategy function to use hf(key, number)
Returns:

An expanding Bloom Filter object

Return type:

ExpandingBloomFilter

Note

Initialization order of operations:
  1. Filepath
  2. est_elements and false_positive_rate
add(key, force=False)[source]

Add the key to the Bloom Filter

Parameters:
  • key (str) – The element to be inserted
  • force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter
add_alt(hashes, force=False)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:
  • hashes (list) – A list of integers representing the key to insert
  • force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter
check(key)[source]

Check to see if the key is in the Bloom Filter

Parameters:key (str) – The key to check for in the Bloom Filter
Returns:True if the element is likely present; False if definately not present
Return type:bool
check_alt(hashes)[source]

Check to see if the hashes are in the Bloom Filter

Parameters:hashes (list) – The hash representation to check for in the Bloom Filter
Returns:True if the element is likely present; False if definately not present
Return type:bool
elements_added

The total number of elements added

Type:int
estimated_elements

The original number of elements estimated to be in the Bloom Filter

Type:int
expansions

The number of expansions

Type:int
export(filepath)[source]

Export an expanding Bloom Filter, or subclass, to disk

Parameters:filepath (str) – The path to the file to import
false_positive_rate

The desired false positive rate of the expanding Bloom Filter

Type:float
push()[source]

Push a new expansion onto the Bloom Filter

RotatingBloomFilter

class probables.RotatingBloomFilter(est_elements=None, false_positive_rate=None, max_queue_size=10, filepath=None, hash_function=None)[source]

Simple Rotating Bloom Filter implementation that allows for the “older” elements added to be removed, in chunks. As the queue fills up, those elements inserted earlier will be bulk removed. This also provides the user with the oportunity to force the removal instead of it being time based.

Parameters:
  • est_elements (int) – The number of estimated elements to be added
  • false_positive_rate (float) – The desired false positive rate
  • max_queue_size (int) – This is the number is used to determine the maximum number of Bloom Filters. Total elements added is based on max_queue_size * est_elements
  • filepath (str) – Path to file to load
  • hash_function (function) – Hashing strategy function to use hf(key, number)

Note

Initialization order of operations:
  1. Filepath
  2. est_elements and false_positive_rate
add(key, force=False)

Add the key to the Bloom Filter

Parameters:
  • key (str) – The element to be inserted
  • force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter
add_alt(hashes, force=False)[source]

Add the element represented by hashes into the Bloom Filter

Parameters:
  • hashes (list) – A list of integers representing the key to insert
  • force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter
check(key)

Check to see if the key is in the Bloom Filter

Parameters:key (str) – The key to check for in the Bloom Filter
Returns:True if the element is likely present; False if definately not present
Return type:bool
check_alt(hashes)

Check to see if the hashes are in the Bloom Filter

Parameters:hashes (list) – The hash representation to check for in the Bloom Filter
Returns:True if the element is likely present; False if definately not present
Return type:bool
current_queue_size

The current size of the queue

Type:int
elements_added

The total number of elements added

Type:int
estimated_elements

The original number of elements estimated to be in the Bloom Filter

Type:int
expansions

The number of expansions

Type:int
export(filepath)

Export an expanding Bloom Filter, or subclass, to disk

Parameters:filepath (str) – The path to the file to import
false_positive_rate

The desired false positive rate of the expanding Bloom Filter

Type:float
max_queue_size

The maximum size for the queue

Type:int
pop()[source]

Pop the oldest Bloom Filter off of the queue without pushing a new Bloom Filter onto the queue

Raises:RotatingBloomFilterError – Unable to rotate the Bloom Filter
push()[source]

Push a new bloom filter onto the queue and rotate if necessary

CountingBloomFilter

class probables.CountingBloomFilter(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]

Simple Counting Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/counting_bloom)

Parameters:
  • est_elements (int) – The number of estimated elements to be added
  • false_positive_rate (float) – The desired false positive rate
  • filepath (str) – Path to file to load
  • hex_string (str) – Hex based representation to be loaded
  • hash_function (function) – Hashing strategy function to use hf(key, number)
Returns:

A Counting Bloom Filter object

Return type:

CountingBloomFilter

Note

Initialization order of operations:
  1. From file
  2. From Hex String
  3. From params
add(key, num_els=1)[source]

Add the key to the Counting Bloom Filter

Parameters:
  • key (str) – The element to be inserted
  • num_els (int) – Number of times to insert the element
Returns:

Maximum number of insertions

Return type:

int

add_alt(hashes, num_els=1)[source]

Add the element represented by hashes into the Counting Bloom Filter

Parameters:
  • hashes (list) – A list of integers representing the key to insert
  • num_els (int) – Number of times to insert the element
Returns:

Maximum number of insertions

Return type:

int

bloom

The bit/int array

Type:list(int)
bloom_length

Length of the Bloom Filter array

Note

Not settable

Type:int
check(key)[source]

Check if the key is likely in the Counting Bloom Filter

Parameters:key (str) – The element to be checked
Returns:Maximum number of insertions
Return type:int
check_alt(hashes)[source]

Check if the element represented by hashes is in the Counting Bloom Filter

Parameters:hashes (list) – A list of integers representing the key to check
Returns:Maximum number of insertions
Return type:int
clear()

Clear or reset the Counting Bloom Filter

current_false_positive_rate()

Calculate the current false positive rate based on elements added

Returns:The current false positive rate
Return type:float
elements_added

Number of elements added to the Bloom Filter

Note

Changing this can cause the current false positive rate to be reported incorrectly

Type:int
estimate_elements()

Estimate the number of unique elements added

Returns:Number of elements estimated to be inserted
Return type:int
estimated_elements

The maximum number of elements estimated to be added at setup

Note

Not settable

Type:int
export(filename)

Export the Bloom Filter to disk

Parameters:filename (str) – The filename to which the Bloom Filter will be written.
export_hex()

Export the Bloom Filter as a hex string

Returns:Hex representation of the Bloom Filter
Return type:str
export_size()

Calculate the size of the bloom on disk

Returns:Size of the Bloom Filter when exported to disk
Return type:int
false_positive_rate

The maximum desired false positive rate

Note

Not settable

Type:float
hash_function

The hash function used

Note

Not settable

Type:function
hashes(key, depth=None)

Return the hashes based on the provided key

Parameters:
  • key (str) – Description of arg1
  • depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes
Returns:

A list of the hashes for the key in int form

Return type:

List(int)

intersection(second)[source]

Take the intersection of two Counting Bloom Filters

Parameters:second (CountingBloomFilter) – The Bloom Filter with which to take the intersection
Returns:The new Counting Bloom Filter containing the union
Return type:CountingBloomFilter
Raises:TypeError – When second is not a CountingBloomFilter

Note

The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

is_on_disk

Is the Bloom Filter on Disk or not

Note

Not settable

Type:bool
jaccard_index(second)[source]

Take the Jaccard Index of two Counting Bloom Filters

Parameters:second (CountingBloomFilter) – The Bloom Filter with which to take the jaccard index
Returns:A numeric value between 0 and 1 where 1 is identical and 0 means completely different
Return type:float
Raises:TypeError – When second is not a CountingBloomFilter

Note

The Jaccard Index is based on the unique set of elements added and not the number of each element added

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

number_bits

Number of bits in the Bloom Filter

Note

Not settable

Type:int
number_hashes

The number of hashes required for the Bloom Filter hashing strategy

Note

Not settable

Type:int
remove(key, num_els=1)[source]

Remove the element from the counting bloom

Parameters:
  • key (str) – The element to be removed
  • num_els (int) – Number of times to remove the element
Returns:

Maximum number of insertions after the removal

Return type:

int

remove_alt(hashes, num_els=1)[source]

Remvoe the element represented by hashes from the Counting Bloom Filter

Parameters:
  • hashes (list) – A list of integers representing the key to remove
  • num_els (int) – Number of times to remove the element
Returns:

Maximum number of insertions after the removal

Return type:

int

union(second)[source]

Return a new Countiong Bloom Filter that contains the union of the two

Parameters:second (CountingBloomFilter) – The Counting Bloom Filter with which to calculate the union
Returns:The new Counting Bloom Filter containing the union
Return type:CountingBloomFilter
Raises:TypeError – When second is not a CountingBloomFilter

Note

The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()

Note

If second is not of the same size (false_positive_rate and est_elements) then this will return None

Cuckoo Filters

Cuckoo filters are a space efficient data structure that supports set membership testing. Cuckoo filters support insertion, deletion, and lookup of elements with low overhead and few false positive results. The name is derived from the cuckoo hashing strategy used to resolve conflicts. Further Reading

CuckooFilter

class probables.CuckooFilter(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]

Simple Cuckoo Filter implementation

Parameters:
  • capacity (int) – The number of bins
  • bucket_size (int) – The number of buckets per bin
  • max_swaps (int) – The number of cuckoo swaps before stopping
  • expansion_rate (int) – The rate at which to expand
  • auto_expand (bool) – If the filter should automatically expand
  • finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import
  • filename (str) – The path to the file to load or None if no file
  • hash_function (function) – Hashing strategy function to use hf(key)
Returns:

A Cuckoo Filter object

Return type:

CuckooFilter

add(key)[source]

Add element key to the filter

Parameters:key (str) – The element to add
Raises:CuckooFilterFullError – When element not inserted after maximum number of swaps or ‘kicks’
auto_expand

True if the cuckoo filter will expand automatically

Type:bool
bucket_size

The number of buckets per bin

Note

Not settable

Type:int
buckets

The buckets holding the fingerprints

Note

Not settable

Type:list(list)
capacity

The number of bins

Note

Not settable

Type:int
check(key)[source]

Check if an element is in the filter

Parameters:key (str) – Element to check
Returns:True if likely present, False if definately not
Return type:bool
elements_added

The number of elements added

Note

Not settable

Type:int
expand()[source]

Expand the cuckoo filter

expansion_rate

The rate at expansion when the filter grows

Type:int
export(filename)[source]

Export cuckoo filter to file

Parameters:filename (str) – Path to file to export
fingerprint_size

The size in bytes of the fingerprint

Raises:ValueError – If the size is not between 1 and 4

Note

The size of the fingerprint must be between 1 and 4

Type:int
load_factor()[source]

float: How full the Cuckoo Filter is currently

max_swaps

The maximum number of swaps to perform

Note

Not settable

Type:int
remove(key)[source]

Remove an element from the filter

Parameters:key (str) – Element to remove
Returns:True if removed, False if not present
Return type:bool

CountingCuckooFilter

class probables.CountingCuckooFilter(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]

Simple Counting Cuckoo Filter implementation

Parameters:
  • capacity (int) – The number of bins
  • bucket_size (int) – The number of buckets per bin
  • max_swaps (int) – The number of cuckoo swaps before stopping
  • expansion_rate (int) – The rate at which to expand
  • auto_expand (bool) – If the filter should automatically expand
  • finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import
  • filename (str) – The path to the file to load or None if no file
Returns:

A Cuckoo Filter object

Return type:

CountingCuckooFilter

add(key)[source]

Add element key to the filter

Parameters:key (str) – The element to add
Raises:CuckooFilterFullError – When element not inserted after maximum number of swaps or ‘kicks’
auto_expand

True if the cuckoo filter will expand automatically

Type:bool
bucket_size

The number of buckets per bin

Note

Not settable

Type:int
buckets

The buckets holding the fingerprints

Note

Not settable

Type:list(list)
capacity

The number of bins

Note

Not settable

Type:int
check(key)[source]

Check if an element is in the filter

Parameters:key (str) – Element to check
Returns:The number of times inserted into the filter
Return type:int
elements_added

The number of elements added

Note

Not settable

Type:int
expand()[source]

Expand the cuckoo filter

expansion_rate

The rate at expansion when the filter grows

Type:int
export(filename)[source]

Export cuckoo filter to file

Parameters:filename (str) – Path to file to export
fingerprint_size

The size in bytes of the fingerprint

Raises:ValueError – If the size is not between 1 and 4

Note

The size of the fingerprint must be between 1 and 4

Type:int
load_factor()[source]

float: How full the Cuckoo Filter is currently

max_swaps

The maximum number of swaps to perform

Note

Not settable

Type:int
remove(key)[source]

Remove an element from the filter

Parameters:key (str) – Element to remove
unique_elements

unique number of elements inserted

Type:int

Count-Min Sketches

Count-Min Sketches, and its derivatives, are good for estimating the number of occurrences of an element in streaming data while not needing to retain all the data elements. The result is a probabilistic count of elements inserted into the data structure. It will always provide the maximum number of times a data element was encountered. Notice that the result may be more than the true number of times it was inserted, but never fewer. Further Reading

CountMinSketch

class probables.CountMinSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Simple Count-Min Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)

Parameters:
  • width (int) – The width of the count-min sketch
  • depth (int) – The depth of the count-min sketch
  • confidence (float) – The level of confidence desired
  • error_rate (float) – The desired error rate
  • filepath (str) – Path to file to load
  • hash_function (function) – Hashing strategy function to use hf(key, number)
Returns:

A Count-Min Sketch object

Return type:

CountMinSketch

Note

Initialization order of operations:
  1. From file
  2. Width and depth
  3. Confidence and error rate

Note

Default query type is min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add(key, num_els=1)[source]

Insert the element key into the count-min sketch

Parameters:
  • key (str) – The element to insert
  • num_els (int) – The number of times to insert the element
Returns:

The number of times the element was likely inserted after the insertion

Return type:

int

add_alt(hashes, num_els=1)[source]

Insert an element by using the hash representation

Parameters:
  • key (str) – The element to insert
  • num_els (int) – The number of times to insert the element
Returns:

The number of times the element was likely inserted after the insertion

Return type:

int

check(key)[source]

Check number of times element ‘key’ is in the count-min sketch

Parameters:key (str) – The key to check the number of times inserted
Returns:The number of times the element was likely inserted
Return type:int
check_alt(hashes)[source]

Check the count-min sketch for an element by using the hash representation

Parameters:hashes (list) – The hashes representing the element to check
Returns:The number of times the element was likely inserted
Return type:int
clear()[source]

Reset the count-min sketch to an empty state

confidence

The confidence of the count-min sketch

Note

Not settable

Type:float
depth

The depth of the count-min sketch

Note

Not settable

Type:int
elements_added

The number of elements added to the count-min sketch

Note

Not settable

Type:int
error_rate

The error rate of the count-min sketch

Note

Not settable

Type:float
export(filepath)[source]

Export the count-min sketch to disk

Parameters:filename (str) – The filename to which the count-min sketch will be written.
hashes(key, depth=None)[source]

Return the hashes based on the provided key

Parameters:
  • key (str) – Description of arg1
  • depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes
Returns:

A list of the hashes for the key in int form

Return type:

List(int)

query_type

The name of the query type being used

Note

Valid values:
  • ‘min’ or None
  • ‘mean’
  • ‘mean-min’
Type:str
remove(key, num_els=1)[source]

Remove element ‘key’ from the count-min sketch

Parameters:
  • key (str) – The element to remove
  • num_els (int) – The number of times to remove the element
Returns:

The number of times the element was likely inserted after the removal

Return type:

int

remove_alt(hashes, num_els=1)[source]

Remove an element by using the hash representation

Parameters:
  • hashes (list) – The hashes representing the element to remove
  • num_els (int) – The number of times to remove the element
Returns:

The number of times the element was likely inserted after the removal

Return type:

int

width

The width of the count-min sketch

Note

Not settable

Type:int

CountMeanSketch

class probables.CountMeanSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Simple Count-Mean Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)

Parameters:
  • width (int) – The width of the count-min sketch
  • depth (int) – The depth of the count-min sketch
  • confidence (float) – The level of confidence desired
  • error_rate (float) – The desired error rate
  • filepath (str) – Path to file to load
  • hash_function (function) – Hashing strategy function to use hf(key, number)
Returns:

A Count-Mean Sketch object

Return type:

CountMeanSketch

Note

Initialization order of operations:
  1. From file
  2. Width and depth
  3. Confidence and error rate

Note

Default query type is mean

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

For more information of all methods and properties, see CountMinSketch.

CountMeanMinSketch

class probables.CountMeanMinSketch(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Simple Count-Mean-Min Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/count-min-sketch)

Parameters:
  • width (int) – The width of the count-min sketch
  • depth (int) – The depth of the count-min sketch
  • confidence (float) – The level of confidence desired
  • error_rate (float) – The desired error rate
  • filepath (str) – Path to file to load
  • hash_function (function) – Hashing strategy function to use hf(key, number)
Returns:

A Count-Mean-Min Sketch object

Return type:

CountMeanMinSketch

Note

Initialization order of operations:
  1. From file
  2. Width and depth
  3. Confidence and error rate

Note

Default query type is mean-min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

For more information of all methods and properties, see CountMinSketch.

HeavyHitters

class probables.HeavyHitters(num_hitters=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

Find and track those elements that are the most common, or heavy hitters

Parameters:
  • num_hitters (int) – The maximum number of distinct elements to track
  • width (int) – The width of the count-min sketch
  • depth (int) – The depth of the count-min sketch
  • confidence (float) – The level of confidence desired
  • error_rate (float) – The desired error rate
  • filepath (str) – Path to file to load
  • hash_function (function) – Hashing strategy function to use hf(key, number)
Returns:

A Count-Min Sketch object

Return type:

HeavyHitters

Note

Initialization order of operations:
  1. From file
  2. Width and depth
  3. Confidence and error rate

Note

Default query type is min

Note

For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add(key, num_els=1)[source]

Add element to heavy hitters

Parameters:
  • key (str) – The element to add
  • num_els (int) – The number of instances to add
Returns:

Number of times key has been inserted

Return type:

int

Note

Override function

add_alt(key, hashes, num_els=1)[source]

Add the element key represented as hashes to the HeavyHitters object (hence the different signature on the function!)

Parameters:
  • key (str) – The element to add
  • hashes (list) – The list of integers representing the key to insert
  • num_els (int) – The number of instances to add
Returns:

Number of times key has been inserted

Return type:

int

Note

Different key signature than the normal CountMinSketch

Note

Override function

clear()[source]

Clear out the heavy hitters!

heavy_hitters

Return the heavy hitters, or most common elements

Note

Not settable

Type:dict
number_heavy_hitters

Return the maximum number of heavy hitters being tracked

Note

Not settable

Type:int
remove_alt(hashes, num_els=1)[source]

Remove element based on hashes provided; not supported in heavy hitters

Raises:NotSupportedError – This function is not supported by the HeavyHitters class

Note

Override function

For more information of all methods and properties, see CountMinSketch.

StreamThreshold

class probables.StreamThreshold(threshold=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]

keep track of those elements over a certain threshold

add(key, num_els=1)[source]

Add the element for key into the data structure

Parameters:
  • key (str) – The element to add
  • num_els (int) – The number of instances to add
Returns:

Number of times key has been inserted

Return type:

int

Note

Override function

add_alt(key, hashes, num_els=1)[source]

Add the element for key into the data structure

Parameters:
  • key (str) – The element to add
  • hashes (list) – The list of integers representing the key to insert
  • num_els (int) – The number of instances to add
Returns:

Number of times key has been inserted

Return type:

int

Note

Different key signature than the normal CountMinSketch

Note

Override function

clear()[source]

Clear out the stream threshold!

meets_threshold

Those keys that meet the required threshold (with value)

Type:dict
remove(key, num_els=1)[source]

Remove element ‘key’ from the count-min sketch

Parameters:
  • key (str) – The element to remove
  • num_els (int) – The number of times to remove the element
Returns:

The number of times the element was likely inserted after the removal

Return type:

int

Note

Override function

remove_alt(key, hashes, num_els=1)[source]

Remove an element by using the hash representation

Parameters:
  • key (str) – The key that the hashes represent
  • hashes (list) – The hashes representing the element to remove
  • num_els (int) – The number of times to remove the element
Returns:

The number of times the element was likely inserted after the removal

Return type:

int

Note

Different key signature than the normal CountMinSketch

Note

Override function

threshold

The threshold at which a key is tracked

Type:int

For more information of all methods and properties, see CountMinSketch.

Exceptions

PyProbables Exceptions

exception probables.exceptions.CuckooFilterFullError(message)[source]

Cuckoo Filter Full Exception

Parameters:message (str) – The error message to be reported
exception probables.exceptions.InitializationError(message)[source]

Initialization Exception

Parameters:message (str) – The initialization error messge
exception probables.exceptions.NotSupportedError(message)[source]

Not Supported Functionality Exception

Parameters:message (str) – The error message to be reported
exception probables.exceptions.ProbablesBaseException(message)[source]

Base ProbablesBaseException

Parameters:message (str) – The error message to be reported
exception probables.exceptions.RotatingBloomFilterError(message)[source]

RotatingBloomFilter unable to rotate Blooms Exceptions

Parameters:message (str) – The error message to be reported

Hashing Functions

Probables Hashing library

probables.hashes.default_fnv_1a(key, depth=1)[source]

The default fnv-1a hashing routine

Parameters:
  • key (str) – The element to be hashed
  • depth (int) – The number of hash permutations to compute
Returns:

List of size depth hashes

Return type:

list(int)

probables.hashes.default_md5(key, depth=1)[source]

The default md5 hashing routine

Parameters:
  • key (str) – The element to be hashed
  • depth (int) – The number of hash permutations to compute
Returns:

List of 64-bit hashed representation of key hashes

Return type:

list(int)

Note

Returns the upper-most 64 bits

probables.hashes.default_sha256(key, depth=1)[source]

The default sha256 hashing routine

Parameters:
  • key (str) – The element to be hashed
  • depth (int) – The number of hash permutations to compute
Returns:

List of 64-bit hashed representation of key hashes

Return type:

list(int)

Note

Returns the upper-most 64 bits

probables.hashes.fnv_1a(key)[source]

Pure python implementation of the 64 bit fnv-1a hash

Parameters:key (str) – The element to be hashed
Returns:64-bit hashed representation of key
Return type:int

Note

Uses the lower 64 bits when overflows occur

probables.hashes.hash_with_depth_bytes(func)[source]

Decorator to turns a function taking a single key and hashes it to bytes. Wraps functions to be used in Bloom filters and Count-Min sketch data structures.

Parameters:
  • key (str) – The element to be hashed
  • depth (int) – The number of hash permutations to compute
Returns:

64-bit hashed representation of key

Return type:

list(int)

Note

Arguments shown are as it will be after decorated

probables.hashes.hash_with_depth_int(func)[source]

Decorator to turn a function that takes a single key and hashes it to an int. Wraps functions to be used in Bloom filters and Count-Min sketch data structures.

Parameters:
  • key (str) – The element to be hashed
  • depth (int) – The number of hash permutations to compute
Returns:

64-bit hashed representation of key

Return type:

list(int)

Note

Arguments shown are as it will be after decorated