Similarity Joins Tutorial#

In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset using a Similarity Join workflow.

workflow2-cora.png

How to install?#

pyJedAI is an open-source library that can be installed from PyPI.

For more: pypi.org/project/pyjedai/

!python --version
Python 3.8.17
!pip install pyjedai -U
!pip show pyjedai
Name: pyjedai
Version: 0.1.0
Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.
Home-page: 
Author: 
Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com>, Jakub Maciejewski <jacobb.maciejewski@gmail.com>, Manolis Koubarakis <koubarak@di.uoa.gr>
License: Apache Software License 2.0
Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages
Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine
Required-by: 

Imports

import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph

from pyjedai.utils import print_clusters, print_blocks, print_candidate_pairs
from pyjedai.evaluation import Evaluation
[nltk_data] Downloading package stopwords to /home/jm/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Reading the dataset#

pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files.

pyjedai <Data> module#

Data module offers a numpber of options

  • Selecting the parameters (columns) of the dataframe, in D1 (and in D2)

  • Prints a detailed text analysis

  • Stores a hidden mapping of the ids, and creates it if not exists.

from pyjedai.datamodel import Data

d1 = pd.read_csv("./../data/der/cora/cora.csv", sep='|')
gt = pd.read_csv("./../data/der/cora/cora_gt.csv", sep='|', header=None)
attr = ['Entity Id','author', 'title']

Data is the connecting module of all steps of the workflow

data = Data(
    dataset_1=d1,
    id_column_name_1='Entity Id',
    ground_truth=gt,
    attributes_1=attr
)

Similarity Joins#

Available algorithms:

  • EJoin

  • TopKJoin

from pyjedai.joins import EJoin, TopKJoin
/home/jm/public-pyJedAI/pyJedAI/src/pyjedai/joins.py:13: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm
join = EJoin(similarity_threshold = 0.5,
             metric = 'jaccard',
             tokenization = 'qgrams_multiset',
             qgrams = 2)

g = join.fit(data)
_ = join.evaluate(g)
***************************************************************************************************************************
                                         Μethod:  EJoin
***************************************************************************************************************************
Method name: EJoin
Parameters: 
	similarity_threshold: 0.5
	metric: jaccard
	tokenization: qgrams_multiset
	qgrams: 2
Runtime: 51.6994 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:     65.80% 
	Recall:        93.03%
	F1-score:      77.08%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
draw(g)
../_images/84f935264e71766885056690bdeff7fd0fc09ff670ec3b58df135abf19f09cd9.png

Entity Clustering#

It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.

from pyjedai.clustering import ConnectedComponentsClustering
ec = ConnectedComponentsClustering()
clusters = ec.process(g, data, similarity_threshold=0.3)
_ = ec.evaluate(clusters)
***************************************************************************************************************************
                                         Μethod:  Connected Components Clustering
***************************************************************************************************************************
Method name: Connected Components Clustering
Parameters: 
Runtime: 0.3853 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:     48.42% 
	Recall:        93.19%
	F1-score:      63.73%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis