pyprobables API¶
Here you can find the full developer API for the pyprobables project. pyprobables provides a suite of probabilistic datastructures to be used in data analytics and data science projects.
Data Structures and Classes¶
Bloom Filters¶
Bloom Filters are a class of probabilistic data structures used for set operations. Bloom Filters guarantee a zero percent false negative rate and a predetermined false positive rate. Once the number of elements inserted exceeds the estimated elements, the false positive rate will increase over the desired amount. Further Reading
BloomFilter¶

class
probables.
BloomFilter
(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]¶ Simple Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)
Parameters:  est_elements (int) – The number of estimated elements to be added
 false_positive_rate (float) – The desired false positive rate
 filepath (str) – Path to file to load
 hex_string (str) – Hex based representation to be loaded
 hash_function (function) – Hashing strategy function to use hf(key, number)
Returns: A Bloom Filter object
Return type: Note
 Initialization order of operations:
 From file
 From Hex String
 From params

add
(key)¶ Add the key to the Bloom Filter
Parameters: key (str) – The element to be inserted

add_alt
(hashes)¶ Add the element represented by hashes into the Bloom Filter
Parameters: hashes (list) – A list of integers representing the key to insert

bloom
¶ The bit/int array
Type: list(int)

bloom_length
¶ Length of the Bloom Filter array
Note
Not settable
Type: int

check
(key)¶ Check if the key is likely in the Bloom Filter
Parameters: key (str) – The element to be checked Returns: True if likely encountered, False if definately not Return type: bool

check_alt
(hashes)¶ Check if the element represented by hashes is in the Bloom Filter
Parameters: hashes (list) – A list of integers representing the key to check Returns: True if likely encountered, False if definately not Return type: bool

clear
()¶ Clear or reset the Counting Bloom Filter

current_false_positive_rate
()¶ Calculate the current false positive rate based on elements added
Returns: The current false positive rate Return type: float

elements_added
¶ Number of elements added to the Bloom Filter
Note
Changing this can cause the current false positive rate to be reported incorrectly
Type: int

estimate_elements
()¶ Estimate the number of unique elements added
Returns: Number of elements estimated to be inserted Return type: int

estimated_elements
¶ The maximum number of elements estimated to be added at setup
Note
Not settable
Type: int

export
(filename)¶ Export the Bloom Filter to disk
Parameters: filename (str) – The filename to which the Bloom Filter will be written.

export_hex
()¶ Export the Bloom Filter as a hex string
Returns: Hex representation of the Bloom Filter Return type: str

export_size
()¶ Calculate the size of the bloom on disk
Returns: Size of the Bloom Filter when exported to disk Return type: int

false_positive_rate
¶ The maximum desired false positive rate
Note
Not settable
Type: float

hash_function
¶ The hash function used
Note
Not settable
Type: function

hashes
(key, depth=None)¶ Return the hashes based on the provided key
Parameters:  key (str) – Description of arg1
 depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes
Returns: A list of the hashes for the key in int form
Return type: List(int)

intersection
(second)[source]¶ Return a new Bloom Filter that contains the intersection of the two
Parameters: second (BloomFilter) – The Bloom Filter with which to take the intersection Returns: The new Bloom Filter containing the intersection Return type: BloomFilter Raises: TypeError
– When second is not either aBloomFilter
orBloomFilterOnDisk
Note
second may be a BloomFilterOnDisk object
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None

is_on_disk
¶ Is the Bloom Filter on Disk or not
Note
Not settable
Type: bool

jaccard_index
(second)[source]¶ Calculate the jaccard similarity score between two Bloom Filters
Parameters: second (BloomFilter) – The Bloom Filter to compare with Returns: A numeric value between 0 and 1 where 1 is identical and 0 means completely different Return type: float Raises: TypeError
– When second is not either aBloomFilter
orBloomFilterOnDisk
Note
second may be a BloomFilterOnDisk object
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None

number_bits
¶ Number of bits in the Bloom Filter
Note
Not settable
Type: int

number_hashes
¶ The number of hashes required for the Bloom Filter hashing strategy
Note
Not settable
Type: int

union
(second)[source]¶ Return a new Bloom Filter that contains the union of the two
Parameters: second (BloomFilter) – The Bloom Filter with which to calculate the union Returns: The new Bloom Filter containing the union Return type: BloomFilter Raises: TypeError
– When second is not either aBloomFilter
orBloomFilterOnDisk
Note
second may be a BloomFilterOnDisk object
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None
BloomFilterOnDisk¶

class
probables.
BloomFilterOnDisk
(filepath, est_elements=None, false_positive_rate=None, hex_string=None, hash_function=None)[source]¶ Simple Bloom Filter implementation directly on disk for use in python; It can read and write the same format as the c version (https://github.com/barrust/bloom)
Parameters:  filepath (str) – Path to file to load
 est_elements (int) – The number of estimated elements to be added
 false_positive_rate (float) – The desired false positive rate
 hex_string (str) – Hex based representation to be loaded
 hash_function (function) – Hashing strategy function to use hf(key, number)
Returns: A Bloom Filter object
Return type: Raises: NotSupportedError
– Loading using a hex string is not supportedNote
 Initialization order of operations:
 Esimated elements and false positive rate
 From Hex String
 Only filepath provided

add_alt
(hashes)[source]¶ Add the element represented by hashes into the Bloom Filter
Parameters: hashes (list) – A list of integers representing the key to insert

export
(filename)[source]¶ Export to disk if a different location
Parameters: filename (str) – The filename to which the Bloom Filter will be exported Note
Only exported if the filename is not the original filename

export_hex
()[source]¶ Export to a hex string
Raises: NotSupportedError
– This functionality is currently not supported

intersection
(second)[source]¶ Return a new Bloom Filter that contains the intersection of the two
Parameters: second (BloomFilter) – The Bloom Filter with which to take the intersection Returns: The new Bloom Filter containing the intersection Return type: BloomFilter Raises: TypeError
– When second is not either aBloomFilter
orBloomFilterOnDisk
Note
second may be a BloomFilter object
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None

jaccard_index
(second)[source]¶ Calculate the jaccard similarity score between two Bloom Filters
Parameters: second (BloomFilter) – The Bloom Filter to compare with Returns: A numeric value between 0 and 1 where 1 is identical and 0 means completely different Return type: float Raises: TypeError
– When second is not either aBloomFilter
orBloomFilterOnDisk
Note
second may be a BloomFilter object
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None

union
(second)[source]¶ Return a new Bloom Filter that contains the union of the two
Parameters: second (BloomFilter) – The Bloom Filter with which to calculate the union Returns: The new Bloom Filter containing the union Return type: BloomFilter Raises: TypeError
– When second is not either aBloomFilter
orBloomFilterOnDisk
Note
second may be a BloomFilter object
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None
For more information of all methods and properties, see BloomFilter.
ExpandingBloomFilter¶

class
probables.
ExpandingBloomFilter
(est_elements=None, false_positive_rate=None, filepath=None, hash_function=None)[source]¶ Simple expanding Bloom Filter implementation for use in python; the Bloom Fiter will automatically expand, or grow, if the false positive rate is about to become greater than the desired false positive rate.
Parameters:  est_elements (int) – The number of estimated elements to be added
 false_positive_rate (float) – The desired false positive rate
 filepath (str) – Path to file to load
 hash_function (function) – Hashing strategy function to use hf(key, number)
Returns: An expanding Bloom Filter object
Return type: Note
 Initialization order of operations:
 Filepath
 est_elements and false_positive_rate

add
(key, force=False)[source]¶ Add the key to the Bloom Filter
Parameters:  key (str) – The element to be inserted
 force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

add_alt
(hashes, force=False)[source]¶ Add the element represented by hashes into the Bloom Filter
Parameters:  hashes (list) – A list of integers representing the key to insert
 force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

check
(key)[source]¶ Check to see if the key is in the Bloom Filter
Parameters: key (str) – The key to check for in the Bloom Filter Returns: True if the element is likely present; False if definately not present Return type: bool

check_alt
(hashes)[source]¶ Check to see if the hashes are in the Bloom Filter
Parameters: hashes (list) – The hash representation to check for in the Bloom Filter Returns: True if the element is likely present; False if definately not present Return type: bool

elements_added
¶ The total number of elements added
Type: int

estimated_elements
¶ The original number of elements estimated to be in the Bloom Filter
Type: int

expansions
¶ The number of expansions
Type: int

export
(filepath)[source]¶ Export an expanding Bloom Filter, or subclass, to disk
Parameters: filepath (str) – The path to the file to import

false_positive_rate
¶ The desired false positive rate of the expanding Bloom Filter
Type: float
RotatingBloomFilter¶

class
probables.
RotatingBloomFilter
(est_elements=None, false_positive_rate=None, max_queue_size=10, filepath=None, hash_function=None)[source]¶ Simple Rotating Bloom Filter implementation that allows for the “older” elements added to be removed, in chunks. As the queue fills up, those elements inserted earlier will be bulk removed. This also provides the user with the oportunity to force the removal instead of it being time based.
Parameters:  est_elements (int) – The number of estimated elements to be added
 false_positive_rate (float) – The desired false positive rate
 max_queue_size (int) – This is the number is used to determine the maximum number of Bloom Filters. Total elements added is based on max_queue_size * est_elements
 filepath (str) – Path to file to load
 hash_function (function) – Hashing strategy function to use hf(key, number)
Note
 Initialization order of operations:
 Filepath
 est_elements and false_positive_rate

add
(key, force=False)¶ Add the key to the Bloom Filter
Parameters:  key (str) – The element to be inserted
 force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

add_alt
(hashes, force=False)[source]¶ Add the element represented by hashes into the Bloom Filter
Parameters:  hashes (list) – A list of integers representing the key to insert
 force (bool) – True will force it to be inserted, even if it likely has been inserted before False will only insert if not found in the Bloom Filter

check
(key)¶ Check to see if the key is in the Bloom Filter
Parameters: key (str) – The key to check for in the Bloom Filter Returns: True if the element is likely present; False if definately not present Return type: bool

check_alt
(hashes)¶ Check to see if the hashes are in the Bloom Filter
Parameters: hashes (list) – The hash representation to check for in the Bloom Filter Returns: True if the element is likely present; False if definately not present Return type: bool

current_queue_size
¶ The current size of the queue
Type: int

elements_added
¶ The total number of elements added
Type: int

estimated_elements
¶ The original number of elements estimated to be in the Bloom Filter
Type: int

expansions
¶ The number of expansions
Type: int

export
(filepath)¶ Export an expanding Bloom Filter, or subclass, to disk
Parameters: filepath (str) – The path to the file to import

false_positive_rate
¶ The desired false positive rate of the expanding Bloom Filter
Type: float

max_queue_size
¶ The maximum size for the queue
Type: int
CountingBloomFilter¶

class
probables.
CountingBloomFilter
(est_elements=None, false_positive_rate=None, filepath=None, hex_string=None, hash_function=None)[source]¶ Simple Counting Bloom Filter implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/counting_bloom)
Parameters:  est_elements (int) – The number of estimated elements to be added
 false_positive_rate (float) – The desired false positive rate
 filepath (str) – Path to file to load
 hex_string (str) – Hex based representation to be loaded
 hash_function (function) – Hashing strategy function to use hf(key, number)
Returns: A Counting Bloom Filter object
Return type: Note
 Initialization order of operations:
 From file
 From Hex String
 From params

add
(key, num_els=1)[source]¶ Add the key to the Counting Bloom Filter
Parameters:  key (str) – The element to be inserted
 num_els (int) – Number of times to insert the element
Returns: Maximum number of insertions
Return type: int

add_alt
(hashes, num_els=1)[source]¶ Add the element represented by hashes into the Counting Bloom Filter
Parameters:  hashes (list) – A list of integers representing the key to insert
 num_els (int) – Number of times to insert the element
Returns: Maximum number of insertions
Return type: int

bloom
¶ The bit/int array
Type: list(int)

bloom_length
¶ Length of the Bloom Filter array
Note
Not settable
Type: int

check
(key)[source]¶ Check if the key is likely in the Counting Bloom Filter
Parameters: key (str) – The element to be checked Returns: Maximum number of insertions Return type: int

check_alt
(hashes)[source]¶ Check if the element represented by hashes is in the Counting Bloom Filter
Parameters: hashes (list) – A list of integers representing the key to check Returns: Maximum number of insertions Return type: int

clear
()¶ Clear or reset the Counting Bloom Filter

current_false_positive_rate
()¶ Calculate the current false positive rate based on elements added
Returns: The current false positive rate Return type: float

elements_added
¶ Number of elements added to the Bloom Filter
Note
Changing this can cause the current false positive rate to be reported incorrectly
Type: int

estimate_elements
()¶ Estimate the number of unique elements added
Returns: Number of elements estimated to be inserted Return type: int

estimated_elements
¶ The maximum number of elements estimated to be added at setup
Note
Not settable
Type: int

export
(filename)¶ Export the Bloom Filter to disk
Parameters: filename (str) – The filename to which the Bloom Filter will be written.

export_hex
()¶ Export the Bloom Filter as a hex string
Returns: Hex representation of the Bloom Filter Return type: str

export_size
()¶ Calculate the size of the bloom on disk
Returns: Size of the Bloom Filter when exported to disk Return type: int

false_positive_rate
¶ The maximum desired false positive rate
Note
Not settable
Type: float

hash_function
¶ The hash function used
Note
Not settable
Type: function

hashes
(key, depth=None)¶ Return the hashes based on the provided key
Parameters:  key (str) – Description of arg1
 depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes
Returns: A list of the hashes for the key in int form
Return type: List(int)

intersection
(second)[source]¶ Take the intersection of two Counting Bloom Filters
Parameters: second (CountingBloomFilter) – The Bloom Filter with which to take the intersection Returns: The new Counting Bloom Filter containing the union Return type: CountingBloomFilter Raises: TypeError
– When second is not aCountingBloomFilter
Note
The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None

is_on_disk
¶ Is the Bloom Filter on Disk or not
Note
Not settable
Type: bool

jaccard_index
(second)[source]¶ Take the Jaccard Index of two Counting Bloom Filters
Parameters: second (CountingBloomFilter) – The Bloom Filter with which to take the jaccard index Returns: A numeric value between 0 and 1 where 1 is identical and 0 means completely different Return type: float Raises: TypeError
– When second is not aCountingBloomFilter
Note
The Jaccard Index is based on the unique set of elements added and not the number of each element added
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None

number_bits
¶ Number of bits in the Bloom Filter
Note
Not settable
Type: int

number_hashes
¶ The number of hashes required for the Bloom Filter hashing strategy
Note
Not settable
Type: int

remove
(key, num_els=1)[source]¶ Remove the element from the counting bloom
Parameters:  key (str) – The element to be removed
 num_els (int) – Number of times to remove the element
Returns: Maximum number of insertions after the removal
Return type: int

remove_alt
(hashes, num_els=1)[source]¶ Remvoe the element represented by hashes from the Counting Bloom Filter
Parameters:  hashes (list) – A list of integers representing the key to remove
 num_els (int) – Number of times to remove the element
Returns: Maximum number of insertions after the removal
Return type: int

union
(second)[source]¶ Return a new Countiong Bloom Filter that contains the union of the two
Parameters: second (CountingBloomFilter) – The Counting Bloom Filter with which to calculate the union Returns: The new Counting Bloom Filter containing the union Return type: CountingBloomFilter Raises: TypeError
– When second is not aCountingBloomFilter
Note
The elements_added property will be set to the estimated number of unique elements added as found in estimate_elements()
Note
If second is not of the same size (false_positive_rate and est_elements) then this will return None
Cuckoo Filters¶
Cuckoo filters are a space efficient data structure that supports set membership testing. Cuckoo filters support insertion, deletion, and lookup of elements with low overhead and few false positive results. The name is derived from the cuckoo hashing strategy used to resolve conflicts. Further Reading
CuckooFilter¶

class
probables.
CuckooFilter
(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]¶ Simple Cuckoo Filter implementation
Parameters:  capacity (int) – The number of bins
 bucket_size (int) – The number of buckets per bin
 max_swaps (int) – The number of cuckoo swaps before stopping
 expansion_rate (int) – The rate at which to expand
 auto_expand (bool) – If the filter should automatically expand
 finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import
 filename (str) – The path to the file to load or None if no file
 hash_function (function) – Hashing strategy function to use hf(key)
Returns: A Cuckoo Filter object
Return type: 
add
(key)[source]¶ Add element key to the filter
Parameters: key (str) – The element to add Raises: CuckooFilterFullError
– When element not inserted after maximum number of swaps or ‘kicks’

auto_expand
¶ True if the cuckoo filter will expand automatically
Type: bool

bucket_size
¶ The number of buckets per bin
Note
Not settable
Type: int

buckets
¶ The buckets holding the fingerprints
Note
Not settable
Type: list(list)

capacity
¶ The number of bins
Note
Not settable
Type: int

check
(key)[source]¶ Check if an element is in the filter
Parameters: key (str) – Element to check Returns: True if likely present, False if definately not Return type: bool

elements_added
¶ The number of elements added
Note
Not settable
Type: int

expansion_rate
¶ The rate at expansion when the filter grows
Type: int

export
(filename)[source]¶ Export cuckoo filter to file
Parameters: filename (str) – Path to file to export

fingerprint_size
¶ The size in bytes of the fingerprint
Raises: ValueError
– If the size is not between 1 and 4Note
The size of the fingerprint must be between 1 and 4
Type: int

max_swaps
¶ The maximum number of swaps to perform
Note
Not settable
Type: int
CountingCuckooFilter¶

class
probables.
CountingCuckooFilter
(capacity=10000, bucket_size=4, max_swaps=500, expansion_rate=2, auto_expand=True, finger_size=4, filepath=None, hash_function=None)[source]¶ Simple Counting Cuckoo Filter implementation
Parameters:  capacity (int) – The number of bins
 bucket_size (int) – The number of buckets per bin
 max_swaps (int) – The number of cuckoo swaps before stopping
 expansion_rate (int) – The rate at which to expand
 auto_expand (bool) – If the filter should automatically expand
 finger_size (int) – The size of the fingerprint to use in bytes (between 1 and 4); exported as 4 bytes; up to the user to reset the size correctly on import
 filename (str) – The path to the file to load or None if no file
Returns: A Cuckoo Filter object
Return type: 
add
(key)[source]¶ Add element key to the filter
Parameters: key (str) – The element to add Raises: CuckooFilterFullError
– When element not inserted after maximum number of swaps or ‘kicks’

auto_expand
¶ True if the cuckoo filter will expand automatically
Type: bool

bucket_size
¶ The number of buckets per bin
Note
Not settable
Type: int

buckets
¶ The buckets holding the fingerprints
Note
Not settable
Type: list(list)

capacity
¶ The number of bins
Note
Not settable
Type: int

check
(key)[source]¶ Check if an element is in the filter
Parameters: key (str) – Element to check Returns: The number of times inserted into the filter Return type: int

elements_added
¶ The number of elements added
Note
Not settable
Type: int

expansion_rate
¶ The rate at expansion when the filter grows
Type: int

export
(filename)[source]¶ Export cuckoo filter to file
Parameters: filename (str) – Path to file to export

fingerprint_size
¶ The size in bytes of the fingerprint
Raises: ValueError
– If the size is not between 1 and 4Note
The size of the fingerprint must be between 1 and 4
Type: int

max_swaps
¶ The maximum number of swaps to perform
Note
Not settable
Type: int

unique_elements
¶ unique number of elements inserted
Type: int
CountMin Sketches¶
CountMin Sketches, and its derivatives, are good for estimating the number of occurrences of an element in streaming data while not needing to retain all the data elements. The result is a probabilistic count of elements inserted into the data structure. It will always provide the maximum number of times a data element was encountered. Notice that the result may be more than the true number of times it was inserted, but never fewer. Further Reading
CountMinSketch¶

class
probables.
CountMinSketch
(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]¶ Simple CountMin Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/countminsketch)
Parameters:  width (int) – The width of the countmin sketch
 depth (int) – The depth of the countmin sketch
 confidence (float) – The level of confidence desired
 error_rate (float) – The desired error rate
 filepath (str) – Path to file to load
 hash_function (function) – Hashing strategy function to use hf(key, number)
Returns: A CountMin Sketch object
Return type: Note
 Initialization order of operations:
 From file
 Width and depth
 Confidence and error rate
Note
Default query type is min
Note
For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add
(key, num_els=1)[source]¶ Insert the element key into the countmin sketch
Parameters:  key (str) – The element to insert
 num_els (int) – The number of times to insert the element
Returns: The number of times the element was likely inserted after the insertion
Return type: int

add_alt
(hashes, num_els=1)[source]¶ Insert an element by using the hash representation
Parameters:  key (str) – The element to insert
 num_els (int) – The number of times to insert the element
Returns: The number of times the element was likely inserted after the insertion
Return type: int

check
(key)[source]¶ Check number of times element ‘key’ is in the countmin sketch
Parameters: key (str) – The key to check the number of times inserted Returns: The number of times the element was likely inserted Return type: int

check_alt
(hashes)[source]¶ Check the countmin sketch for an element by using the hash representation
Parameters: hashes (list) – The hashes representing the element to check Returns: The number of times the element was likely inserted Return type: int

confidence
¶ The confidence of the countmin sketch
Note
Not settable
Type: float

depth
¶ The depth of the countmin sketch
Note
Not settable
Type: int

elements_added
¶ The number of elements added to the countmin sketch
Note
Not settable
Type: int

error_rate
¶ The error rate of the countmin sketch
Note
Not settable
Type: float

export
(filepath)[source]¶ Export the countmin sketch to disk
Parameters: filename (str) – The filename to which the countmin sketch will be written.

hashes
(key, depth=None)[source]¶ Return the hashes based on the provided key
Parameters:  key (str) – Description of arg1
 depth (int) – Number of permutations of the hash to generate; if None, generate number_hashes
Returns: A list of the hashes for the key in int form
Return type: List(int)

query_type
¶ The name of the query type being used
Note
 Valid values:
 ‘min’ or None
 ‘mean’
 ‘meanmin’
Type: str

remove
(key, num_els=1)[source]¶ Remove element ‘key’ from the countmin sketch
Parameters:  key (str) – The element to remove
 num_els (int) – The number of times to remove the element
Returns: The number of times the element was likely inserted after the removal
Return type: int

remove_alt
(hashes, num_els=1)[source]¶ Remove an element by using the hash representation
Parameters:  hashes (list) – The hashes representing the element to remove
 num_els (int) – The number of times to remove the element
Returns: The number of times the element was likely inserted after the removal
Return type: int

width
¶ The width of the countmin sketch
Note
Not settable
Type: int
CountMeanSketch¶

class
probables.
CountMeanSketch
(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]¶ Simple CountMean Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/countminsketch)
Parameters:  width (int) – The width of the countmin sketch
 depth (int) – The depth of the countmin sketch
 confidence (float) – The level of confidence desired
 error_rate (float) – The desired error rate
 filepath (str) – Path to file to load
 hash_function (function) – Hashing strategy function to use hf(key, number)
Returns: A CountMean Sketch object
Return type: Note
 Initialization order of operations:
 From file
 Width and depth
 Confidence and error rate
Note
Default query type is mean
Note
For width and depth, width may realistically be in the thousands while depth is in the single digit to teens
For more information of all methods and properties, see CountMinSketch.
CountMeanMinSketch¶

class
probables.
CountMeanMinSketch
(width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]¶ Simple CountMeanMin Sketch implementation for use in python; It can read and write the same format as the c version (https://github.com/barrust/countminsketch)
Parameters:  width (int) – The width of the countmin sketch
 depth (int) – The depth of the countmin sketch
 confidence (float) – The level of confidence desired
 error_rate (float) – The desired error rate
 filepath (str) – Path to file to load
 hash_function (function) – Hashing strategy function to use hf(key, number)
Returns: A CountMeanMin Sketch object
Return type: Note
 Initialization order of operations:
 From file
 Width and depth
 Confidence and error rate
Note
Default query type is meanmin
Note
For width and depth, width may realistically be in the thousands while depth is in the single digit to teens
For more information of all methods and properties, see CountMinSketch.
HeavyHitters¶

class
probables.
HeavyHitters
(num_hitters=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]¶ Find and track those elements that are the most common, or heavy hitters
Parameters:  num_hitters (int) – The maximum number of distinct elements to track
 width (int) – The width of the countmin sketch
 depth (int) – The depth of the countmin sketch
 confidence (float) – The level of confidence desired
 error_rate (float) – The desired error rate
 filepath (str) – Path to file to load
 hash_function (function) – Hashing strategy function to use hf(key, number)
Returns: A CountMin Sketch object
Return type: Note
 Initialization order of operations:
 From file
 Width and depth
 Confidence and error rate
Note
Default query type is min
Note
For width and depth, width may realistically be in the thousands while depth is in the single digit to teens

add
(key, num_els=1)[source]¶ Add element to heavy hitters
Parameters:  key (str) – The element to add
 num_els (int) – The number of instances to add
Returns: Number of times key has been inserted
Return type: int
Note
Override function

add_alt
(key, hashes, num_els=1)[source]¶ Add the element key represented as hashes to the HeavyHitters object (hence the different signature on the function!)
Parameters:  key (str) – The element to add
 hashes (list) – The list of integers representing the key to insert
 num_els (int) – The number of instances to add
Returns: Number of times key has been inserted
Return type: int
Note
Different key signature than the normal
CountMinSketch
Note
Override function

heavy_hitters
¶ Return the heavy hitters, or most common elements
Note
Not settable
Type: dict

number_heavy_hitters
¶ Return the maximum number of heavy hitters being tracked
Note
Not settable
Type: int
For more information of all methods and properties, see CountMinSketch.
StreamThreshold¶

class
probables.
StreamThreshold
(threshold=100, width=None, depth=None, confidence=None, error_rate=None, filepath=None, hash_function=None)[source]¶ keep track of those elements over a certain threshold

add
(key, num_els=1)[source]¶ Add the element for key into the data structure
Parameters:  key (str) – The element to add
 num_els (int) – The number of instances to add
Returns: Number of times key has been inserted
Return type: int
Note
Override function

add_alt
(key, hashes, num_els=1)[source]¶ Add the element for key into the data structure
Parameters:  key (str) – The element to add
 hashes (list) – The list of integers representing the key to insert
 num_els (int) – The number of instances to add
Returns: Number of times key has been inserted
Return type: int
Note
Different key signature than the normal
CountMinSketch
Note
Override function

meets_threshold
¶ Those keys that meet the required threshold (with value)
Type: dict

remove
(key, num_els=1)[source]¶ Remove element ‘key’ from the countmin sketch
Parameters:  key (str) – The element to remove
 num_els (int) – The number of times to remove the element
Returns: The number of times the element was likely inserted after the removal
Return type: int
Note
Override function

remove_alt
(key, hashes, num_els=1)[source]¶ Remove an element by using the hash representation
Parameters:  key (str) – The key that the hashes represent
 hashes (list) – The hashes representing the element to remove
 num_els (int) – The number of times to remove the element
Returns: The number of times the element was likely inserted after the removal
Return type: int
Note
Different key signature than the normal
CountMinSketch
Note
Override function

threshold
¶ The threshold at which a key is tracked
Type: int

For more information of all methods and properties, see CountMinSketch.
Exceptions¶
PyProbables Exceptions

exception
probables.exceptions.
CuckooFilterFullError
(message)[source]¶ Cuckoo Filter Full Exception
Parameters: message (str) – The error message to be reported

exception
probables.exceptions.
InitializationError
(message)[source]¶ Initialization Exception
Parameters: message (str) – The initialization error messge

exception
probables.exceptions.
NotSupportedError
(message)[source]¶ Not Supported Functionality Exception
Parameters: message (str) – The error message to be reported
Hashing Functions¶
Probables Hashing library

probables.hashes.
default_fnv_1a
(key, depth=1)[source]¶ The default fnv1a hashing routine
Parameters:  key (str) – The element to be hashed
 depth (int) – The number of hash permutations to compute
Returns: List of size depth hashes
Return type: list(int)

probables.hashes.
default_md5
(key, depth=1)[source]¶ The default md5 hashing routine
Parameters:  key (str) – The element to be hashed
 depth (int) – The number of hash permutations to compute
Returns: List of 64bit hashed representation of key hashes
Return type: list(int)
Note
Returns the uppermost 64 bits

probables.hashes.
default_sha256
(key, depth=1)[source]¶ The default sha256 hashing routine
Parameters:  key (str) – The element to be hashed
 depth (int) – The number of hash permutations to compute
Returns: List of 64bit hashed representation of key hashes
Return type: list(int)
Note
Returns the uppermost 64 bits

probables.hashes.
fnv_1a
(key)[source]¶ Pure python implementation of the 64 bit fnv1a hash
Parameters: key (str) – The element to be hashed Returns: 64bit hashed representation of key Return type: int Note
Uses the lower 64 bits when overflows occur

probables.hashes.
hash_with_depth_bytes
(func)[source]¶ Decorator to turns a function taking a single key and hashes it to bytes. Wraps functions to be used in Bloom filters and CountMin sketch data structures.
Parameters:  key (str) – The element to be hashed
 depth (int) – The number of hash permutations to compute
Returns: 64bit hashed representation of key
Return type: list(int)
Note
Arguments shown are as it will be after decorated

probables.hashes.
hash_with_depth_int
(func)[source]¶ Decorator to turn a function that takes a single key and hashes it to an int. Wraps functions to be used in Bloom filters and CountMin sketch data structures.
Parameters:  key (str) – The element to be hashed
 depth (int) – The number of hash permutations to compute
Returns: 64bit hashed representation of key
Return type: list(int)
Note
Arguments shown are as it will be after decorated