The Five Generations of Entity Resolution on Web Data#
Tutorial at 24th International Conference on Web Engineering (ICWE 2024)
Location: Tampere, Finland
Date: Monday 17th of June 2024, 11:00 – 12:30 @ Auditorium A4
View and/or download slides here: Download slides
Presenters#
Research Associate at University of Athens
Assistant Professor at Tilburg University Product Matching expert
Senior Researcher at University of AthensEntity Resolution expert
Abstract#
Entity Resolution constitutes a core data integration task that has attracted a bulk of works on improving its effectiveness and time efficiency. This tutorial provides a comprehensive overview of the field, distinguishing relevant methods into five main generations. The first one targets Veracity in the context of structured data with a clean schema. The second generation extends its focus to cover Volume, as well, leveraging multi-core or massive parallelization to process large-scale datasets. The third generation addresses the additional challenge of Variety, targeting voluminous, noisy, semi-structured, and highly heterogeneous data from the Semantic Web. The fourth generation also tackles Velocity so as to process data collections of a continuously increasing volume. The latest works, though, belong to the fifth generation, involving pre-trained (large) language models which heavily rely on external knowledge to address all four Vs with high effectiveness.
Programme#
Introduction and motivation, including ER preliminaries, fundamental assumptions, principles, and overview of the generations.
1st generation, focusing on Veracity, with schema matching, blocking, entity matching, and methods using external Knowledge.
2nd generation, tackling Volume and Veracity, including parallel blocking, parallel entity matching, and load balancing.
3rd generation, tackling Variety, Volume and Veracity with techniques including schema clustering, block building, block processing, entity matching, entity clustering.
4th generation, tackling Velocity, Variety, Volume and Veracity with progressive and incremental ER, as well as query-driven resolution.
5th generation, leveraging external knowledge with pre-trained LLMs pipelines, and crowdsourcing, LLMs.
Hands-on session with pyJedAI, an open-source Python package with which we will build demos from scratch over real data.
Challenges and Final Remarks, including automatic parameter configuration and future research directions.
References#
Altwaijry, H., et al.: Query: A framework for integrating entity resolution with query processing. PVLDB (2015)
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. PVLDB 4(11), 695–701 (2011)
B ̈ohm, C., et al.: LINDA: distributed web-of-data-scale entity matching. In: CIKM. pp. 2104–2108 (2012)
Christen, P.: Data Matching. Springer (2012)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Efthymiou, V., et al.: Self-configured Entity Resolution with pyJedAI. In: IEEE Big Data (2023)
Golshan, B., Halevy, A., Mihaila, G., Tan, W.: Data integration: After the teenage years. In: PODS. pp. 101–106 (2017)
Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)
Hassanzadeh, O., et al.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)
Ioannou, E., Garofalakis, M.: Query analytics over probabilistic databases with unmerged duplicates. TKDE (2015)
Kolb, L., Thor, A., Rahm, E.: Dedoop: Efficient deduplication with hadoop. PVLDB 5(12), 1878–1881
Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: ICDE. pp. 618–629 (2012)
Lacoste-Julien, S., et al.: Sigma: simple greedy matching for aligning large knowledge bases. In: KDD. pp. 572–580 (2013)
Li, J., et al.: Rimom: A dynamic multistrategy ontology alignment framework.TKDE 21(8), 1218–1232 (2009)
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid.In: VLDB. pp. 49–58 (2001)
Nikoletos, K., Papadakis, G., Koubarakis, M.: pyJedAI: a lightsaber for Link Discovery. In: ISWC (2022)
Papadakis, G., Ioannou, E., Palpanas, T.: Entity resolution: Past, present and yet-to-come. In: EDBT. pp. 647–650 (2020)
Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The Four Generations of Entity Resolution. Synthesis Lectures on Data Management, Morgan & Claypool Publishers (2021)
Stefanidis, K., Efthymiou, V., Herschel, M., Christophides, V.: Entity resolution in the web of data. In: WWW (2014)
Suchanek, F.M., et al.: PARIS: probabilistic alignment of relations, instances, and schema. PVLDB 5(3), 157–168 (2011)
Zeakis, A., Papadakis, G., Skoutas, D., Koubarakis, M.: Pre-trained Embeddings for Entity Resolution: An Experimental Analysis. In: VLDB (2023)