pyjedai.utils

pyjedai.utils#

Functions

`add_entry`(workflow, dataframe_dictionary)	Retrieves features and their values from the given workflow dictionary,
`are_matching`(entity_index, id1, id2)	id1 and id2 consist a matching pair if: - Blocks: intersection > 0 (comparison of sets) - Clusters: cluster-id-j == cluster-id-i (comparison of integers)
`batch_pairs`(iterable[, batch_size])	Generator function that breaks an iterable into batches of a set size.
`block_with_one_entity`(block, is_dirty_er)	Checks for one entity blocks.
`canonical_swap`(id1, id2)	Returns the identifiers in canonical order
`chi_square`(in_array)	Chi Square Method
`clear_json_file`(path)
`common_elements`(elements1, elements2)	Returns the union of the elements of both lists in the order they appear in the first list
`cosine`(x, y)	Cosine similarity between two vectors
`create_entity_index`(blocks, is_dirty_er)	Creates a dict of entity ids to block keys .
`drop_big_blocks_by_size`(blocks, ...)	Drops blocks if:
`drop_single_entity_blocks`(blocks, is_dirty_er)	Removes one-size blocks for DER and empty for CCER
`generate_unique_identifier`()	Returns unique identifier which is used to cross reference workflows stored in json file and their performance graphs
`get_blocks_cardinality`(blocks, is_dirty_er)	Returns the cardinality of the blocks.
`get_class_function_arguments`(...)	Returns a list of argument names for requested function of the given class :param class_reference: Reference to a class :param function_name: Name of the requested function :type function_name: str
`get_multiples`(num, n)	Returns a list of multiples of the requested number up to n * number
`get_ngrams`(text, n)
`get_qgram_from_tokenizer_name`(tokenizer)	Returns the q-gram value from the tokenizer name.
`get_reverse_indexing_id`(id, data)
`get_sorted_blocks_shuffled_entities`(...)	Sorts blocks in alphabetical order based on their token, shuffles the entities of each block, concatenates the result in a list
`has_duplicate_pairs`(pairs)
`is_infinite`(value)
`java_math_round`(value)
`matching_arguments`(workflow, arguments)	Checks if given workflow's arguments that are shared with the target arguments have values that appear in the those arguments
`necessary_dfs_supplied`(configuration)	Configuration file contains values for source, target and ground truth dataframes
`new_dictionary_from_keys`(dictionary, keys)	Returns a subset of the given dictionary including only the given keys.
`print_blocks`(blocks, is_dirty_er)	Prints all the contents of the block index.
`print_candidate_pairs`(blocks)	Prints candidate pairs index in natural language.
`print_clusters`(clusters)	Prints clusters contents.
`purge_id_column`(columns)
`read_data_from_json`(json_path, base_dir[, ...])	Reads dataset details from a JSON file and returns a Data object.
`retrieve_top_workflows`([workflows, ...])	Takes a workflow dictionary or retrieves it from given path.
`reverse_blocks_entity_indexing`(blocks, data)	Returns a new instance of blocks containing the entity IDs of the given blocks translated into the reverse indexing system :param blocks: blocks as defined in the previous indexing :type blocks: dict :param data: Previous data module used to define the reversed ids based on previous dataset limit and dataset sizes :type data: Data
`reverse_data_indexing`(data)	Returns a new data model based upon the given data model with reversed indexing of the datasets :param data: input dat a model :type data: Data
`reverse_prunned_blocks_entity_indexing`(...)
`reverse_raw_blocks_entity_indexing`(blocks, data)
`sorted_enumerate`(seq[, reverse])
`text_cleaning_method`(col)	Lower clean.
`to_path`(path)
`update_top_results`(results, new_workflow, ...)	Based on its performance, sets the new workflow as the top one in
`values_given`(configuration, parameter)	Values for requested parameters have been supplied by the user in the configuration file
`workflows_to_dataframe`([workflows, ...])	Takes a workflow dictionary or retrieves it from given path.

Classes

`DatasetScheduler`([budget, entity_ids, ...])	Stores a dictionary [Entity -> Entity's Neighborhood Data (Whoosh Neighborhood)]
`EntityScheduler`(id)	Stores information about the neighborhood of a given entity ID: - ID : The identifier of the entity as it is defined within the original dataframe - Total Weight : The total weight of entity's neighbors - Number of Neighbors : The total number of Neighbors - Neighbors : Entity's neighbors sorted in descending order of weight - Stage : Insert / Pop stage (entities stored in ascending / descending weight order)
`FrequencyEvaluator`(vectorizer, tokenizer, qgram)
`PositionIndex`(num_of_entities, sorted_entities)	For each entity identifier stores a list of index it appears in, within the list of shuffled entities of sorted blocks
`PredictionData`(matcher, matcher_info)	Auxiliarry module used to store basic information about the to-emit, predicted pairs
`SubsetIndexer`(blocks, data, subset)	Stores the indices of retained entities of the initial datasets, calculates and stores the mapping of element indices from new to old dataset (id in subset -> id in original)
`Tokenizer`()
`WordQgramTokenizer`([q])