11 / 19 / 2021

The Twitter Crawler

Social media platforms nowadays, such as Twitter, act as catalysts in news items diffusion and as a means of personal opinion expression. It only seems crucial to be able to study such mass behaviors, but before that one needs to collect such data efficiently and effectively.

The official Twitter policy regarding sharing their data is generally extrovert in case of user public posting. Thus, Twitter offers multiple Application Programming Interfaces (APIs) open and accessible in various economic tiers for individual researchers, institutions and hobbyists. CERTH has been developing specialized tools called wrappers which cooperate in harmony with the APIs twitter is currently offering for free but with some restrictions. It needs to be clarified at this moment that for copyright infringements Twitter has a strict policy, removing any content violating any legislation. As a second tier of copyright protection laws, Twitter considers responsible each of its users to check whether their content is under any copyright protection. Consequently meaning that an individual who uses a crawler to retrieve public posts will not under any circumstances be held responsible for copyrights violation.

As a tool, the twitter crawler, is fully automated at the moment, however it needs a user to fuel it with some kind of specific input in order to start retrieving public online data. Our main concern for the case of the SO-Close project and specifically for the input was to be consisted initially of various targeted collections of keywords. It may sound an easy task, however, the user group needed to assemble so as to investigate and suggest the most project-pertinent groups of keywords/phrases/sentences resulting, all in all, in 14 collections and over 200 key phrases. The 14 collections distinctly were infused into the configuration files of the Twitter crawler which remained online as a service for quite some time fetching online public tweet posts over a long period of time and storing it inside a MongoDB database. The initial results of the first testing phase were really promising as 291929 public tweet posts were amassed resulting in a mainly textual data-set with a volume of approximately 1,8 GB in a hard disk drive.

At this point, the necessity of the Twitter crawler and all the crawlers in general during the scope of the SO-Close project seems even clearer. The actual purpose of such a tool is to retrieve scarce original material from users really afar and cut off, a fact which in turn boosts and enables the plurality and the diversity in the creation of immersive experiences for the Memory Center Platform. After all, it is about an extra tool in the quiver of specialists to manage and bring us from so far to so close.