The newly published Danish Dynaword offers high-quality, reproducible Danish
text data from a broad range of sources, including books, legal texts, and online
discussions. All datasets adhere to strict guidelines for documentation and open
licensing, making them suitable for research, AI development, and commercial
use.
The open dataset has been developed by Danish Foundation Models – a
collaboration between Center for Humanities Computing representing Aarhus
University, the University of Southern Denmark, the Alexandra Institute, and the
University of Copenhagen. It addresses key shortcomings in existing datasets,
such as outdated content, unclear rights, and limited domain coverage.
The aim of Danish Dynaword is to set new standards for the quality of publicly
available Danish language datasets by establishing clear guidelines for corpus
curation based on open-source principles. This approach recognises that projects
are never truly finished but can be continually refined and updated as data
sources, software, and methodologies evolve. This is achieved through
community-driven initiatives hosted on online platforms, where source code,
processes, and documentation are openly shared to promote reproducibility and
enhance transparency.
By providing a future-proof and trustworthy alternative to existing corpora,
Danish Dynaword represents a significant step forward for Danish language
technology. As Postdoc Kenneth Enevoldsen puts it:
“Open-source data and code are essential for obtaining reliable improvements for
Danish language technologies, since they allow others to leverage existing work as
a foundation for continued progress. In addition, open-source data completely
removes the potential risk of downstream language models to produce copyrighted
content.”
The preprint of the scientific article can be accessed here
And you can read more about the work of Danish Foundation Models here