Developing new text data collections for specific research questions
Crawling textual data sources from the Internet (compliant with GDPR)
- collecting corpora from social media platforms such as Twitter & Facebook
- crawling threads from reddit.com etc.
Data conversion and extraction from various file formats
- ​unstructured data: .pdf / .doc / .txt etc.
- structured data: XML etc.
Text data preprocessing​
- ​t´Ç°ì±ð²Ô¾±³ú²¹³Ù¾±´Ç²Ô
- automatic linguistic analysis: e.g. Named Entity Recognition (NER)
- automated cleanup: spelling normalization & correction