The Five Generations of Entity Resolution on Web Data#

Tutorial at 24th International Conference on Web Engineering (ICWE 2024)

Location: Tampere, Finland

Date: Monday 17th of June 2024, 11:00 – 12:30 @ Auditorium A4

View and/or download slides here: Download slides

Presenters#

Assistant Professor at Tilburg University Product Matching expert

Senior Researcher at University of AthensEntity Resolution expert

Abstract#

Entity Resolution constitutes a core data integration task that has attracted a bulk of works on improving its effectiveness and time efficiency. This tutorial provides a comprehensive overview of the field, distinguishing relevant methods into five main generations. The first one targets Veracity in the context of structured data with a clean schema. The second generation extends its focus to cover Volume, as well, leveraging multi-core or massive parallelization to process large-scale datasets. The third generation addresses the additional challenge of Variety, targeting voluminous, noisy, semi-structured, and highly heterogeneous data from the Semantic Web. The fourth generation also tackles Velocity so as to process data collections of a continuously increasing volume. The latest works, though, belong to the fifth generation, involving pre-trained (large) language models which heavily rely on external knowledge to address all four Vs with high effectiveness.

Programme#

  • Introduction and motivation, including ER preliminaries, fundamental assumptions, principles, and overview of the generations.

  • 1st generation, focusing on Veracity, with schema matching, blocking, entity matching, and methods using external Knowledge.

  • 2nd generation, tackling Volume and Veracity, including parallel blocking, parallel entity matching, and load balancing.

  • 3rd generation, tackling Variety, Volume and Veracity with techniques including schema clustering, block building, block processing, entity matching, entity clustering.

  • 4th generation, tackling Velocity, Variety, Volume and Veracity with progressive and incremental ER, as well as query-driven resolution.

  • 5th generation, leveraging external knowledge with pre-trained LLMs pipelines, and crowdsourcing, LLMs.

  • Hands-on session with pyJedAI, an open-source Python package with which we will build demos from scratch over real data.

  • Challenges and Final Remarks, including automatic parameter configuration and future research directions.

References#

  1. Altwaijry, H., et al.: Query: A framework for integrating entity resolution with query processing. PVLDB (2015)

  2. Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. PVLDB 4(11), 695–701 (2011)

  3. B ̈ohm, C., et al.: LINDA: distributed web-of-data-scale entity matching. In: CIKM. pp. 2104–2108 (2012)

  4. Christen, P.: Data Matching. Springer (2012)

  5. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

  6. Efthymiou, V., et al.: Self-configured Entity Resolution with pyJedAI. In: IEEE Big Data (2023)

  7. Golshan, B., Halevy, A., Mihaila, G., Tan, W.: Data integration: After the teenage years. In: PODS. pp. 101–106 (2017)

  8. Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)

  9. Hassanzadeh, O., et al.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)

  10. Ioannou, E., Garofalakis, M.: Query analytics over probabilistic databases with unmerged duplicates. TKDE (2015)

  11. Kolb, L., Thor, A., Rahm, E.: Dedoop: Efficient deduplication with hadoop. PVLDB 5(12), 1878–1881

  12. Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: ICDE. pp. 618–629 (2012)

  13. Lacoste-Julien, S., et al.: Sigma: simple greedy matching for aligning large knowledge bases. In: KDD. pp. 572–580 (2013)

  14. Li, J., et al.: Rimom: A dynamic multistrategy ontology alignment framework.TKDE 21(8), 1218–1232 (2009)

  15. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid.In: VLDB. pp. 49–58 (2001)

  16. Nikoletos, K., Papadakis, G., Koubarakis, M.: pyJedAI: a lightsaber for Link Discovery. In: ISWC (2022)

  17. Papadakis, G., Ioannou, E., Palpanas, T.: Entity resolution: Past, present and yet-to-come. In: EDBT. pp. 647–650 (2020)

  18. Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The Four Generations of Entity Resolution. Synthesis Lectures on Data Management, Morgan & Claypool Publishers (2021)

  19. Stefanidis, K., Efthymiou, V., Herschel, M., Christophides, V.: Entity resolution in the web of data. In: WWW (2014)

  20. Suchanek, F.M., et al.: PARIS: probabilistic alignment of relations, instances, and schema. PVLDB 5(3), 157–168 (2011)

  21. Zeakis, A., Papadakis, G., Skoutas, D., Koubarakis, M.: Pre-trained Embeddings for Entity Resolution: An Experimental Analysis. In: VLDB (2023)

Acknowledgements#


       

This work was supported by the Horizon Europe project STELAR (Grant No. 101070122).