
Open-source Resources for NLP Tasks in Swedish - Ebbot Blog
Open-source resources for NLP tasks in Swedish
The field of natural language processing (NLP) has grown in the past few years with an incredible speed. State-of-the-art, yet open-source tools developed by popular companies and/or researchers in the field, such as OpenAI with GPT-2/3, Explosion with spaCy or StandfordNLP have helped saving much time and effort spent on a variety of NLP tasks. However, for non-English languages that are not spoken by a majority in the world (for example: Finnish, Swedish, etc) the resources might be limited. The NLP team at Hello Ebbot has been struggling with gathering and developing NLP models by ourselves. Understanding the difficulty, in this blog, we will share all open-source resources (including models, tools and datasets) that we found specifically serves Swedish - the language that our colleague Ebbot speaks mainly.
Lemmatization and POS tag
Lemmatization is the act of removing the endings of a word in order to return it to the base or dictionary form, which is usually known as lemma. Sounds pretty cute, doesn't it! Lemmatization is considered to be a crucial task in pre-processing data for NLP tasks, especially when it comes to building chatbots, because it allows the machine to understand human language more accurately. For example, the lemma for gör (verb) is göra and plattformar (plural) is lemmatized to plattform. Another important task in pre-processing is Part-of-Speech (POS) tagging. The name itself is explanatory. With this task, each word (and other token) will be assigned a part of speech; such as noun, verb, adjective, etc.
Through Github, we found a reliable resource, a Python package which allows you to perform both lemmatization and POS tag for Swedish text in just a few lines of code. By wrapping UDPipe pre-trained models as a spaCy pipeline for 50+ languages, TakeLab opens a possibility to efficiently perform lemmatization and POS tagging. Here is the link to their Github repo.
NER datasets for spaCy training
SpaCy is one of the most popular libraries among NLP practitioners and researchers with pre-trained models for tagging, parsing and entity recognition supporting 15 languages at the time this blog is written. SpaCy also enables developers to train new model for unsupported languages, thus, Hello Ebbot's NLP team decided to try training a SpaCy NER for Swedish entities.
Finding solid datasets has always been a journey for us but luckily, we found a Swedish manually annotated corpus by a fellow NLP practitioner, Andreas Klintberg. The dataset was however adapted for CoreNLP (in .txt format) while SpaCy requires .json file as training data format. But don't you worry, we got you covered! If you head to this Medium post by DataTurks, you can find a script written for the conversion just as you need to train your own SpaCy NER.
We also trained a NER model on this dataset, as well as Swedish fastText vectors. The result was 91.6% precision for PER (person), 82.8% for LOC (location), 73.9% for ORG (organization) and 40.3% for MISC (miscellaneous).
Training results for the Swedish SpaCy NER
The model works very well on people, cities, popular organizations and some street names in Sweden. However, we also noticed that it cannot detect some streets, for example: Fredriksdalsvägen. If you really want to have a SpaCy NER, we recommend converting these datasets from .csv to .json to have a larger corpus. Otherwise, please move on to the next section if you think using BERT is also fine 👇
Swedish BERT models
The National Library of Sweden (KBLab) generously shared not one, but three pre-trained language models, which was trained on a whopping amount of 15-20GB of text. Among these three, the most impressive one in our opinion must be bert-base-swedish-cased-ner due to its insane precision in matching entities. On KBLab's Github, you can find an evaluation on NER between this model and SweBERT by Arbetsförmedlingen (The Swedish Public Employment Service):
Depending on the goals of your NLP tasks, you can either clone from their Github repo or for an easy instantiation, using the Huggingface pipeline is also another option. However, in order to achieve this precision, which is much higher compared to spaCy, you will have to sacrifice the speed, because BERT model takes far longer to train and even produce result. That's why we mentioned both of these two options, so you can choose models based on your development needs.
A little announcement:
We hope that this blog post is informative and resources found can at least help you save some time and effort in your NLP tasks. Just like every practitioner and researcher in the field, we would love to share our findings, or even researches and case studies. If you are also NLP enthusiasts like us, please keep an eye on our blog section for new content at least once per month!
References:
https://towardsdatascience.com/lemmatization-in-natural-language-processing-nlp-and-machine-learning-a4416f69a7b6
More stories

How the EU AI Act will shape the future of service automation
The clock is ticking. The EU AI Act is set to become law, reshaping how artificial intelligence is developed, deployed, and regulated in Europe. For organizations looking to integrate AI solutions, this legislation raises important questions about compliance, accountability, and the choice of AI providers.

Ebbot Achieves ISO 27001 Certification
In 2024, we took on a bold challenge: to earn the internationally recognized ISO 27001 certification. In December, we achieved that goal, marking an important milestone in Ebbot’s commitment to delivering AI-powered service automation with the highest standards of security.

Press release: Gofido first to launch EbbotGPT to customers - Ebbot Blog
Swedish insurance provider Gofido is taking a significant step in its commitment to delivering exceptional customer service by officially launching EbbotGPT. This marks a historic milestone as Gofido becomes the first insurance provider in Sweden to integrate generative AI into its customer support chatbot.

We’re opening our API for EbbotGPT
In celebration of the one-year anniversary of EbbotGPT, we are happy to announce that we are now opening our API for our EU-hosted LLMs, EbbotGPT. This marks a significant milestone in our journey to offer robust AI-driven customer service solutions that are fully compliant with EU data regulations.

From overwhelmed to empowered: GenAI’s role in succeeding with self-service in ITSM
In today’s fast-paced business world, having an efficient internal service management (ITSM) system is more important than ever. But let’s be honest—many ITSM systems are neither user-friendly nor scalable, which ends up making them inefficient. Enter Generative AI (GenAI), a technology that could solve this. But how can we take advantage of this technology in an effective use case without risking security? Let’s break it down.

Ebbot becomes the preferred GenAI partner to renowned chatbot expert Campfire AI
Stockholm, Sweden – July 8, 2024 Campfire AI, a Brussels-based conversational AI consultancy firm, has handpicked Ebbot as its new GenAI partner. From now on, Campfire AI will offer Ebbot’s services to all clients seeking to leverage GenAI in service automation. Ebbot,…

Enento Group chooses Ebbot as strategic AI partner for service automation
Stockholm, Sweden – June 19, 2024 **With a focus on providing a secure GenAI platform for automating service processes at scale, Ebbot has become an attractive partner for enterprises looking to deliver a world class AI service experience. Now signing the Nordic knowledge company [Enento…

Small vs. Large GenAI models – pros & cons
When it comes to generative AI (GenAI) models, size does matter—just maybe not how you'd expect. Both small and large GenAI models have their strengths and weaknesses. Understanding these can help you choose the best model for your needs. Let's break down the pros and cons.🌟 ## The buzz…

Coeo leverages Generative AI to enhance customer experience
coeo Inkassos is rapidly growing and aims to be one of Sweden's largest debt collection agencies in the next five years. Focusing on customer experience as a central strategy, coeo has now set itself apart by becoming the first in the industry to offer 24/7 support with generative AI.

How to make your data sources AI-ready: Step-by-step
Generative AI has revolutionized chatbot training. What once took hours is now completed in minutes. BUT, (there's always a but), the effectiveness of a Generative AI-trained chatbot heavily depends on the quality of its data sources. So, what constitutes a "good" data source for a GenAI chatbot, and what measures can be taken to prepare? Let's find out.

Cross-border service: coeo's live chat breaks down language barriers with a click
The debt collection company coeo Sweden takes its customer service to the next level by introducing an automatic translation feature in its live chat. With the new feature, users can now get real-time support in any language they prefer.

Ebbot Acknowledged by Deloitte as One of the Top 50 Fastest-Growing Technology Companies in Sweden
Stockholm, Sweden, November 2, 2023. Ebbot, providing a conversational AI platform for managing service processes at scale, has been acknowledged by Deloitte as one of the top 50 fastest-growing technology companies in Sweden. ### Background Ebbot,…