pyjedai.utils

pyjedai.utils#

Functions

add_entry(workflow, dataframe_dictionary)

Retrieves features and their values from the given workflow dictionary,

are_matching(entity_index, id1, id2)

id1 and id2 consist a matching pair if: - Blocks: intersection > 0 (comparison of sets) - Clusters: cluster-id-j == cluster-id-i (comparison of integers)

batch_pairs(iterable[, batch_size])

Generator function that breaks an iterable into batches of a set size.

block_with_one_entity(block, is_dirty_er)

Checks for one entity blocks.

canonical_swap(id1, id2)

Returns the identifiers in canonical order

chi_square(in_array)

Chi Square Method

clear_json_file(path)

common_elements(elements1, elements2)

Returns the union of the elements of both lists in the order they appear in the first list

cosine(x, y)

Cosine similarity between two vectors

create_entity_index(blocks, is_dirty_er)

Creates a dict of entity ids to block keys .

drop_big_blocks_by_size(blocks, ...)

Drops blocks if:

drop_single_entity_blocks(blocks, is_dirty_er)

Removes one-size blocks for DER and empty for CCER

generate_unique_identifier()

Returns unique identifier which is used to cross reference workflows stored in json file and their performance graphs

get_blocks_cardinality(blocks, is_dirty_er)

Returns the cardinality of the blocks.

get_class_function_arguments(...)

Returns a list of argument names for requested function of the given class :param class_reference: Reference to a class :param function_name: Name of the requested function :type function_name: str

get_multiples(num, n)

Returns a list of multiples of the requested number up to n * number

get_ngrams(text, n)

get_qgram_from_tokenizer_name(tokenizer)

Returns the q-gram value from the tokenizer name.

get_reverse_indexing_id(id, data)

get_sorted_blocks_shuffled_entities(...)

Sorts blocks in alphabetical order based on their token, shuffles the entities of each block, concatenates the result in a list

has_duplicate_pairs(pairs)

is_infinite(value)

java_math_round(value)

matching_arguments(workflow, arguments)

Checks if given workflow's arguments that are shared with the target arguments have values that appear in the those arguments

necessary_dfs_supplied(configuration)

Configuration file contains values for source, target and ground truth dataframes

new_dictionary_from_keys(dictionary, keys)

Returns a subset of the given dictionary including only the given keys.

print_blocks(blocks, is_dirty_er)

Prints all the contents of the block index.

print_candidate_pairs(blocks)

Prints candidate pairs index in natural language.

print_clusters(clusters)

Prints clusters contents.

purge_id_column(columns)

read_data_from_json(json_path, base_dir[, ...])

Reads dataset details from a JSON file and returns a Data object.

retrieve_top_workflows([workflows, ...])

Takes a workflow dictionary or retrieves it from given path.

reverse_blocks_entity_indexing(blocks, data)

Returns a new instance of blocks containing the entity IDs of the given blocks translated into the reverse indexing system :param blocks: blocks as defined in the previous indexing :type blocks: dict :param data: Previous data module used to define the reversed ids based on previous dataset limit and dataset sizes :type data: Data

reverse_data_indexing(data)

Returns a new data model based upon the given data model with reversed indexing of the datasets :param data: input dat a model :type data: Data

reverse_prunned_blocks_entity_indexing(...)

reverse_raw_blocks_entity_indexing(blocks, data)

sorted_enumerate(seq[, reverse])

text_cleaning_method(col)

Lower clean.

to_path(path)

update_top_results(results, new_workflow, ...)

Based on its performance, sets the new workflow as the top one in

values_given(configuration, parameter)

Values for requested parameters have been supplied by the user in the configuration file

workflows_to_dataframe([workflows, ...])

Takes a workflow dictionary or retrieves it from given path.

Classes

DatasetScheduler([budget, entity_ids, ...])

Stores a dictionary [Entity -> Entity's Neighborhood Data (Whoosh Neighborhood)]

EntityScheduler(id)

Stores information about the neighborhood of a given entity ID: - ID : The identifier of the entity as it is defined within the original dataframe - Total Weight : The total weight of entity's neighbors - Number of Neighbors : The total number of Neighbors - Neighbors : Entity's neighbors sorted in descending order of weight - Stage : Insert / Pop stage (entities stored in ascending / descending weight order)

FrequencyEvaluator(vectorizer, tokenizer, qgram)

PositionIndex(num_of_entities, sorted_entities)

For each entity identifier stores a list of index it appears in, within the list of shuffled entities of sorted blocks

PredictionData(matcher, matcher_info)

Auxiliarry module used to store basic information about the to-emit, predicted pairs

SubsetIndexer(blocks, data, subset)

Stores the indices of retained entities of the initial datasets, calculates and stores the mapping of element indices from new to old dataset (id in subset -> id in original)

Tokenizer()

WordQgramTokenizer([q])