Introducing DaCy 2.7.0: Enhanced Models and Exciting New Features

22 May 2023 by Line Ejby Sørensen

We're thrilled to announce the release of DaCy 2.7.0, packed with exciting updates and powerful additions to our natural language processing (NLP) package. With this release, we've made significant improvements across various tasks, expanded the range of supported features, and set the stage for even more advancements in the near future. Let's dive into the details of what's new in DaCy 2.7.0.

Updated DaCy Models and Training Data

One of the highlights of this release is the update to our DaCy models (small, medium, large), now at version 0.2.0. Trained on the intersection of multiple Danish Datasets (CDT, DDT, DaCoref, DaNE, DaNED)
these models generally perform better than previous models even though they are trained on less data! We additionally also added documents annotations from the CDT to the other datasets to improve to improve the processing of long range dependencies.

Beta Support for Coreference Resolution and Named Entity Linking

We're excited to unveil the beta support for coreference resolution and named entity linking in DaCy 2.7.0.
Coreference resolution allows you to identify and connect expressions that refer to the same entity, enhancing context understanding in your NLP applications.
Additionally, named entity linking enables you to link entities in text to their corresponding entries in a knowledge base, opening up new possibilities for entity enrichment.
While both features are in beta, they represent significant steps forward in expanding DaCy's capabilities.

Enhanced Dependency Parsing, Part-of-Speech Tagging, and Lemmatization

DaCy 2.7.0 incorporates the latest version of the DDT treebank, enabling improved dependency parsing and part-of-speech tagging. By leveraging the updated DDT treebank, we've fine-tuned these tasks to achieve state-of-the-art performance beating our previous state-of-the-art models.
Moreover, we've utilized the new trainable lemmatizer by spaCy, resulting in substantial enhancements to lemmatization accuracy. You'll experience more accurate and reliable linguistic analysis with these improvements.

Performance Breakdown for Large, Medium, and Small Models

Our large model, a true powerhouse, has achieved state-of-the-art performance in dependency parsing, part-of-speech tagging, morphological tagging, and lemmatization. We've witnessed significant leaps in lemmatization accuracy, improving from 84.91 to an impressive 95.89. However, the performance of named entity recognition (NER) has slightly decreased to 87.38. To mitigate this, we recommend integrating either the SotA ScandiNER model using `nlp.add_pipe("dacy/ner")` or opting for the one new fine-grained NER models added in DaCy 2.6.0.
The medium model showcases consistent improvements across all tasks. Notably, the F1 score for NER has risen from 81.79 to 85.82, and lemmatization accuracy has increased from 84.91 to 94.
Similarly, the small model exhibits consistent enhancements across all tasks. This compact model is perfect for projects with limited computational resources, without compromising performance too much.

Fixes and Removals

We have resolved several issues and made the overall user experience smoother in this release. Notably, we removed custom requirements for the large model and eliminated warnings during model loading. Additionally, we fixed annotation errors in the DDT treebank where "'" was not followed by a space.
As part of our forward progress, we have removed support for DaCy model version 0.1.0.

What's Next?

The DaCy 2.7.0 release marks another stepping stone towards our vision of providing powerful and versatile NLP toolchain for Danish. Looking ahead, we have exciting plans in store:

1. Coreference Resolution Only Model: A dedicated model focused solely on coreference resolution to further enhance its performance and capabilities.

2. Improved Named Entity Linking: The current entity linker has limitation stemming from the original annotations of DaNED which e.g. currently annotate person entieties using the QID reference to the name. Which we believe is not the ideal behavior. We are working on improving this.

3. Enhanced Knowledge Base: A current limitation of the entity linker is the knowledge base, which is currently limited. We are working on expanding the knowledge base to include more entities and improve the overall performance. Potentially incorperating official Danish registries as well as wikipedia.

4. Model Generalization: We are exploring ways to improve model generalization using the upcoming Danish language resource, DANSK. By leveraging this valuable asset, we aim to enhance the overall performance and generalization of our models.

These are just a few of the exciting directions we're pursuing, with many more advancements on the horizon.

Upgrade to DaCy 2.7.0 today and experience the enhanced models and exciting new features. We're committed to continuously improving DaCy to empower you with state-of-the-art NLP capabilities. Stay tuned for more updates, and happy coding with DaCy!