04 / 27 / 2022

Easie as Easy Information Extraction

Hello from the CERTH Team. Today we would like to talk to you about Easie 🙂

The experience of a user visiting a website involves lots of software development, both in the front-end and in the back-end. The prior is responsible for the appearance of the website to a user whereas the later implements silently functions that need to take place when certain events occur. Apparently, websites include valuable pieces of information that in some cases may be retrieved and exploited by an interested 3rd party.

Information retrieval from the web, and particularly websites, has been a hot research topic since the beginning of the digital era. Various tools have been developed over the years, others scrapping entire websites including noise but some performing focused crawling and targeting only specific parts of a web-page. CERTH has developed and is extending such a tool which falls into the 2nd category, easIE. A data aggregating enthusiast should also keeps in his/her mind that not every site allows its content to be retrieved and reused due to copyright infringements. Others permit it partially by even providing application programming interfaces or ethical explanations of how to do so in order not to deplete the server’s resources and have an impact on its availability.

Easie is an acronym for easy Information Extraction. It is an easy-to-use information extraction framework that initially extracted data about companies from heterogeneous Web sources in a semi-automatic manner. Since then, it has been extended so as to cover various types of sources and content. It functions by allowing admin users to extract data from heterogeneous Web sources in a semi-automatic manner by only defining a configuration file. The framework is quickly and simply generating Web Information Extractors and Wrappers. It offers a set of wrappers for obtaining content from Static and Dynamic HTML pages by pointing to the HTML elements using CSS Selectors. As a consequence, it demands that the website to be crawled follows a semantically well-defined structure in order for the framework to truly reach its full potential.

CERTH has been developing and testing easIE throughout the life cycle of the SO-CLOSE project in order to satisfy the requirements about retrieving legally targeted content from official repositories and websites, such as https://www.amnesty.org/en/, https://www.digitalmeetsculture.net and http://cultural-opposition.eu. Every element so far resides inside a local MongoDB database instance and inside the knowledge graph of the project, ready to be included into the process of creating immersive experiences.