Danish natural language processing (NLP) has seen remarkable improvements in recent years, however, these Improvements rarely trickle down to practitioners who seek to use NLP within their field of expertise. DaCy is a framework developed by researchers at CHC seeking to make state-of-the-art tools accessible for researchers in humanities, social sciences and more.
DaCy’s pipelines have been trained and implemented in SpaCy and has proved successful in language technological tasks, including named-entity recognition, part-of-speech tagging, and dependency parsing. Moreover, DaCy integrates existing state-of-the-art models, for e.g. polarity detection, emotion classification, or hate-speech analysis.
By augmentation of the test set DaNE, the developers of DaCy have conducted a series of tests for biases and robustness of Danish NLP pipelines. DaCy excels in this context and has shown especially robust to long input lengths as well as spelling variations and errors, which might be especially relevant when dealing with data derived from social media or historic sources. Contrary to DaCy, all other models display significant ethnicity biases, and Polyglot furthermore contains a significant gender bias.
The DaCy project argues that data augmentation is particularly useful for obtaining more realistic and fine-grained performance estimates for languages with limited benchmark sets. With this, DaCy seeks to make a broader systematic evaluation of language models and provides a solid basis for further development both within the field of Danish NLP and for professionals who seek to use NLP within their domain.
DaCy is developed by Kenneth Enevoldsen (Ph.D candidate at CHC), Kristoffer Nielbo (head of CHC), and Lasse Hansen (Ph.D candidate, Department for Culture, Cognition and Computation).
Check out the DaCy Github repository and documentation