The Five Generations of Entity Resolution on Web Data
=============



**Tutorial at [24th International Conference on Web Engineering (ICWE 2024)](https://icwe2024.webengineering.org)**


**Location**: Tampere, Finland

**Date**: Monday 17th of June 2024, 11:00 – 12:30 @ Auditorium A4

View and/or download slides here: [{bdg-success}`Download slides`](https://github.com/AI-team-UoA/pyJedAI/blob/main/docs/presentations/ICWE2024/5gER-Tutorial-Slides.pdf)


# Presenters


::::{grid}
:gutter: 3

:::{grid-item-card} [Konstantinos Nikoletos](https://nikoletos-k.github.io)
Research Associate at [University of Athens](https://en.uoa.gr)
:::

:::{grid-item-card} [Ekaterini Ioannou](https://www.tilburguniversity.edu/staff/ekaterini-ioannou)
Assistant Professor at [Tilburg University](https://www.tilburguniversity.edu)
{bdg-primary}`Product Matching expert`
:::

:::{grid-item-card} [George Papadakis](https://gpapadis.wordpress.com)
Senior Researcher at [University of Athens](https://en.uoa.gr){bdg-primary}`Entity Resolution expert`
:::

::::


# Abstract

Entity Resolution constitutes a core data integration task that has attracted a bulk of works on improving its effectiveness and time efficiency. This tutorial provides a comprehensive overview of the field, distinguishing relevant methods into five main generations. The first one targets Veracity in the context of structured data with a clean schema. The second generation extends its focus to cover Volume, as well, leveraging multi-core or massive parallelization to process large-scale datasets. The third generation addresses the additional challenge of Variety, targeting voluminous, noisy, semi-structured, and highly heterogeneous data from the Semantic Web. The fourth generation also tackles Velocity so as to process data collections of a continuously increasing volume. The latest works, though, belong to the fifth generation, involving pre-trained (large) language models which heavily rely on external knowledge to address all four Vs with high effectiveness.

# Programme

- **Introduction and motivation**, including ER preliminaries, fundamental assumptions, principles, and overview of the generations.
- **1st generation**, focusing on Veracity, with schema matching, blocking, entity matching, and methods using external Knowledge.
- **2nd generation**, tackling Volume and Veracity, including parallel blocking, parallel entity matching, and load balancing.
- **3rd generation**, tackling Variety, Volume and Veracity with techniques including schema clustering, block building, block processing, entity matching,
entity clustering.
- **4th generation**, tackling Velocity, Variety, Volume and Veracity with progressive and incremental ER, as well as query-driven resolution.
- **5th generation**, leveraging external knowledge with pre-trained LLMs pipelines, and crowdsourcing, LLMs.
- **Hands-on session with pyJedAI**, an open-source Python package with which we will build demos from scratch over real data.
- **Challenges and Final Remarks**, including automatic parameter configuration and future research directions.

# References

1. Altwaijry, H., et al.: Query: A framework for integrating entity resolution with query processing. PVLDB (2015)
2. Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. PVLDB 4(11), 695–701 (2011)
3. B ̈ohm, C., et al.: LINDA: distributed web-of-data-scale entity matching. In: CIKM. pp. 2104–2108 (2012)
4. Christen, P.: Data Matching. Springer (2012)
5. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
6. Efthymiou, V., et al.: Self-configured Entity Resolution with pyJedAI. In: IEEE Big Data (2023)
7. Golshan, B., Halevy, A., Mihaila, G., Tan, W.: Data integration: After the teenage years. In: PODS. pp. 101–106 (2017)
8. Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)
9. Hassanzadeh, O., et al.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)
10. Ioannou, E., Garofalakis, M.: Query analytics over probabilistic databases with unmerged duplicates. TKDE (2015)
11. Kolb, L., Thor, A., Rahm, E.: Dedoop: Efficient deduplication with hadoop. PVLDB 5(12), 1878–1881
12. Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: ICDE. pp. 618–629 (2012)
13. Lacoste-Julien, S., et al.: Sigma: simple greedy matching for aligning large knowledge bases. In: KDD. pp. 572–580 (2013)
14. Li, J., et al.: Rimom: A dynamic multistrategy ontology alignment framework.TKDE 21(8), 1218–1232 (2009)
15. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid.In: VLDB. pp. 49–58 (2001)
16. Nikoletos, K., Papadakis, G., Koubarakis, M.: pyJedAI: a lightsaber for Link Discovery. In: ISWC (2022)
17. Papadakis, G., Ioannou, E., Palpanas, T.: Entity resolution: Past, present and yet-to-come. In: EDBT. pp. 647–650 (2020)
18. Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The Four Generations of Entity Resolution. Synthesis Lectures on Data Management, Morgan & Claypool
Publishers (2021)
19. Stefanidis, K., Efthymiou, V., Herschel, M., Christophides, V.: Entity resolution in the web of data. In: WWW (2014)
20. Suchanek, F.M., et al.: PARIS: probabilistic alignment of relations, instances, and schema. PVLDB 5(3), 157–168 (2011)
21. Zeakis, A., Papadakis, G., Skoutas, D., Koubarakis, M.: Pre-trained Embeddings for Entity Resolution: An Experimental Analysis. In: VLDB (2023)

# Acknowledgements

<div align="center">
  <br>
 <a href="https://stelar-project.eu">
  <img align="center" src="https://stelar-project.eu/wp-content/uploads/2022/08/Logo-Stelar-1-f.png" width=180/>
 </a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 <!-- <a href="https://ec.europa.eu/info/index_en">
  <img align="left" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b7/Flag_of_Europe.svg/1200px-Flag_of_Europe.svg.png" width=140/>
 </a> -->
 <br><br>
 This work was supported by the <a href="https://research-and-innovation.ec.europa.eu/funding/funding-opportunities/funding-programmes-and-open-calls/horizon-europe_en">Horizon Europe</a> project  <a href="https://stelar-project.eu">STELAR</a> (Grant No. 101070122).<br>
</div>
<br>
<br>