Danish Dynaword - new collection of Danish free-form text datasets

The open dataset collection sets new standards by combining transparency, continuous updates, and transparent licensing

The newly published Danish Dynaword offers high-quality, reproducible Danish text data from a broad range of sources, including books, legal texts, and online discussions. All datasets adhere to strict guidelines for documentation and open licensing, making them suitable for research, AI development, and commercial use.The open dataset has been developed by Danish Foundation Models – a collaboration between Center for Humanities Computing representing Aarhus University, the University of Southern Denmark, the Alexandra Institute, and the University of Copenhagen.

It addresses key shortcomings in existing datasets, such as outdated content, unclear rights, and limited domain coverage.The aim of Danish Dynaword is to set new standards for the quality of publicly available Danish language datasets by establishing clear guidelines for corpus curation based on open-source principles.

This approach recognises that projects are never truly finished but can be continually refined and updated as data sources, software, and methodologies evolve. This is achieved through community-driven initiatives hosted on online platforms, where source code, processes, and documentation are openly shared to promote reproducibility and enhance transparency.By providing a future-proof and trustworthy alternative to existing corpora, Danish Dynaword represents a significant step forward for Danish language technology.

As Center for Humanities Computing Postdoc Kenneth Enevoldsen puts it:

“Open-source data and code are essential for obtaining reliable improvements for Danish language technologies, since they allow others to leverage existing work as a foundation for continued progress. In addition, open-source data completely removes the potential risk of downstream language models to produce copyrighted content.”

The dataset is freely available on Hugging Face:
https://huggingface.co/datasets/danish-foundation-models/danish-dynaword

And a preprint of the scientific article can be accessed here:
https://pure.au.dk/portal/da/publications/dynaword-from-one-shot-to-continuously-developed-datasets