Teven Le Scao (Hugging Face)

Talk title: BigScience: Collaboratively training a large multilingual language model


Large (> 100B parameters) language models have been the subject of intense research, engineering, and speculation since OpenAI released GPT-3. Although several groups have trained or are currently training such a model, all of the currently publicly known trained ones are monolingual autoregressive models like GPT-3, and all but one are closed-source. In this talk, I'll walk the audience through the process of building a collaborative, open-access 176B-parameter language model through our experience with BigScience.

Sanchit Gandhi (Hugging Face)

Tutorial title: Hands-on NLP with Hugging Face and Gradio


This tutorial will take a deep dive into the Hugging Face ecosystem. We will explore the collection of over 50,000 models and 6,000 datasets shared on the Hugging Face Hub, covering efficient ways to find the right models and datasets for any given task. We will learn how to load and use the most popular ML models and datasets in just two lines of code and see how to use them in custom ML pipelines. We conclude by building a demo to showcase an ML model to the community.

Aline Villavicencio (Sheffield)

Talk title: Is the bullet silver for the computational modelling of Multiword Expressions or are we still biting it?


Recent advances in Natural Language Processing (NLP) and Machine Learning (ML) have contributed to the development of very large scale models of languages, in particular of word representations, that have led to substantial improvements in downstream tasks like machine translation, information retrieval and text simplification. Indeed, contextualised word representation models have been successfully used for capturing distinct (and very specific) word usages, and therefore could provide an attractive alternative for accurately determining meaning in language. However, these models still face a serious challenge when dealing with non-literal language, like that involved in Multiword Expressions (MWEs) such as idioms (make ends meet), light verb constructions (give a sigh), verb particle constructions (shake up) and noun compounds (loan shark). MWEs are an integral part of the mental lexicon of native speakers often used to express complex ideas in a simple and conventionalised way accepted by a given linguistic community. Although they may display a wealth of idiosyncrasies, from lexical, syntactic and semantic to statistical, that represents a real challenge for current NLP techniques, their accurate integration has the potential for improving the precision, naturalness and fluency of downstream tasks like machine translation. In this talk, I will present an overview of advances in word representations and their use for the identification and modelling of MWEs. I will concentrate on techniques for identifying their degree of idiomaticity and approximating their meaning, as their interpretation often needs more knowledge than can be gathered from their individual components and their combinations to differentiate combinations whose meaning can be (partly) inferred from their parts (as apple juice: juice made of apples) from those that cannot (as dark horse: an unknown candidate who unexpectedly succeeds).

Elena Gribovskaya (DeepMInd)

Talk title: Keeping language models in sync with the real world

Abstract [More information here]:

Many recent successes in language models (LMs) have been achieved within a ‘static paradigm’, where the focus is on improving performance on the benchmarks that are created without considering the temporal aspect of data. For instance, answering questions on the events that the model could learn about during training, or evaluating on text sub-sampled from the same period as the training data. However, our language and knowledge are dynamic and ever evolving. Therefore, to enable a more realistic evaluation of question-answering models for the next leap in performance, it’s essential to ensure they are flexible and robust when encountering new and unseen data.

In this talk, we will describe our experiments and results on taking current state-of-the-art models and placing them in a realistic scenario of making predictions from beyond the models' training period. We will talk about new streaming language modeling and question answering benchmarks created for this purpose and show how current Transformer-based models perform worse in this setup. We will then present and contrast different ways to keep our models in sync with the world as new data arrive, either by continually updating the (monolithic) models’ parameters or by leveraging semi-parametric approaches that flexibly store and use knowledge in a modular way. Finally, towards more open-ended models that can remain in sync with the ever-changing world, we will introduce a new family of models, internet-augmented models, that leverage the power of commercial search engines as a source of factual and up-to-date knowledge.

Xuefei Wen (AMPLYFI)

Talk title: Unlocking the value of free text for better decision-making


The web provides a wealth of text-based data in the form of news articles, academic papers or patents. Taking advantage of this information is crucial for better decision making, however, acquiring, processing and making sense of such data is challenging due to its unstructured nature and the different personas that consume it. In this talk, we'll introduce some of AMPLYFI's techniques for extracting data and deriving insights from web data in the context of two use cases: (1) Sales Enablement: for helping salespeople sell more and faster; and (2) Technology Intelligence: for improving decision-making around technology investment. We will also discuss some of the challenges we have faced along the way, and our way forward.

Isabel Groves (Amazon)

Talk title: NLP and Knowledge Graphs research for Amazon Alexa’s question answering

Ignacio Iacobacci (Huawei Noah's Ark Lab)

Talk title: Pretrained Language Models for Code Understanding and Generation


In the last few years there has been a tremendous growth in the topic of understanding and generation using NLP-grounded deep learning models. While earlier approaches were able to deal with just the simplest tasks, the recent application of Pretrained Language Models (PLMs), specifically trained with code snippets, has brought new capabilities, especially for the task of text-to-code generation or program synthesis. This talk will discuss the reasons of the recent growth of interest on this topic. We will discuss the main differences between working on natural language and programming language. We will provide an overview of the latest approaches, their intended use and limitations. We will introduce Open AI Codex, among other models, which constitutes the building block of Github Copilot, a tool for code recommendation and auto-completion. We will cover existing datasets and benchmarks that are useful to make a fair comparison among different approaches. We will mention CodeXGlue, a benchmark that covers most common code-oriented tasks, and HumanEval which is, at this time, the de facto benchmark for text-to-code generation. Finally, we will show some applications in the real word and future perspectives in the area.

Christopher Bryant (University of Cambridge / Reverso)

Talk title: An Introduction to Automatic Grammatical Error Correction


Grammatical Error Correction (GEC) is the task of automatically detecting and correcting all kinds of errors in text. The field has grown significantly in the past decade and now enjoys increased visibility in products such as Microsoft Word, Google Docs and Grammarly. In this talk, I will provide an overview of the field and introduce the datasets, approaches, and evaluation methods that are commonly used to build GEC systems. I will conclude with recent trends and remaining challenges for future work.