Start building your own chatbot now >

Recurrent Neural Networks (RNNs) quickly became the go-to neural network architecture for Natural Language Processing (NLP) tasks. In this blog post, I’ll start with a broad definition of their architecture, and then explain what makes them so popular with the NLP community. Finally, I’ll list a collection of blog posts, tutorials, research papers, and frequently asked questions to help you discover the different flavours of RNNs.

 

A RNN can be seen a chain of copies of the same network. Credits to Christopher Olah, http://colah.github.io/posts/2015-08-Understanding-LSTMs/, 2015

Over the course of the last few years, recurrent architecture for neural networks established themselves as state-of-the-art in several NLP tasks, ranging from Named Entity Recognition[1]Zhiheng Huang et al., Bidirectional LSTM-CRF Models for Sequence Tagging, 2015 to Language Modeling[2]Stephen Merity et al., Regularizing and Optimizing LSTM Language Models, 2017 through Machine Translation[3]Yonghui Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.
This successful breakthrough comes a long time after the first proposition of this kind of architecture, around 30 years ago[4]John Hopfield, Neural networks and physical systems with emergent collective computational abilities, 1982(20 years ago for modern architectures[5]Sepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, 1997).

The main advantage of RNNs resides in their ability to deal with sequential data, thanks to their “memory”. Whereas Artificial Neural Networks (ANNs) have no notion of time, and the only input they consider is the current example they are being fed, RNNs consider both the current input and a “context unit” built upon what they’ve seen previously.

So the prediction made by the network at timestep T is influenced by the one it made at timestep T – 1. And when you think about it, that’s pretty much what we do, as humans, we use our previous experience (T – 1) to handle new and unseen things (T).
Christopher Olah puts it very nicely in his blog post[6]Christopher Olah, Understanding LSTMs, 2015:

"As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again."
— Christopher Olah

An early schema of a recurrent unit, notice the context units. Credits: Jeffrey Elman, Finding structure in time, 1990

 

And luckily for us, NLP is full of sequential (or temporal) data. Be it sentences, words, or characters, we always use the context to establish a more precise meaning for communication, whether it is written or oral.

Here’s a few examples:
In Machine Translation, a word will carry different meanings based on the context. Sentiment Analysis will detect modifiers (like “very”, “not”, and “a bit too”) to grasp the intensity, polarity or negation of a sentiment. In Dialog Management, the next step of a conversation is conditioned by the previous interactions and the goal given to the system. For Tokenization, we can use the next and previous characters to say whether or not a new word is beginning.
And it doesn’t stop there: Part of Speech Tagging, Sentence Segmentation, Language Modeling, Semantic Role Labelling, Text Summarization, Spell Checking, and a whole lot of other tasks rely on the sequential nature of the data.

Google’s Neural Machine Translation uses a deep LSTM architecture with 8 encoders and 8 decoders layers using both attention and residual connections. Credit to Quoc Le, Mike Schuster, https://research.googleblog.com/2016/09/a-neural-network-for-machine.html, 2016

 

But RNNs are not perfect yet: the need for the last timestep result at each timestep computation makes them slow to train, and computationally expensive. Today, more and more researchers are using Convolutional Neural Networks (CNNs), because they offer speed and accuracy improvements in many tasks.

Still, the phrase “an LSTM with an attention layer will yield state-of-the-art results on any task” is not to be forgotten, and recurrent architectures will populate user-facing NLP systems and benchmark baselines for a long time.

Blogs

Introductions

Studies

Tutorials

Research

Surveys

Theses

Papers

FAQ

How are recurrent neural networks different from convolutional neural networks?

What is the difference between Recurrent Neural Networks and Recursive Neural Networks?

What is the difference between LSTM and GRU for RNNs?

How to select the number of hidden layers/hidden units in LSTM ?

What’s so great about LSTMs?

What is masking in a Recurrent Neural Network?

What is the attention mechanism introduced in RNNs?

Is LSTM turing complete?

When should one decide to use a LSTM in a Neural Network?

Why doesn’t LSTM forget gate cause a vanishing/dying gradient?

What is the difference between states and outputs in an LSTM?

Is it possible to do online learning with LSTMs?


Also published on Medium.

References   [ + ]

1. Zhiheng Huang et al., Bidirectional LSTM-CRF Models for Sequence Tagging, 2015
2. Stephen Merity et al., Regularizing and Optimizing LSTM Language Models, 2017
3. Yonghui Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016
4. John Hopfield, Neural networks and physical systems with emergent collective computational abilities, 1982
5. Sepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, 1997
6. Christopher Olah, Understanding LSTMs, 2015

Want to build your own conversational bot? Get started with Recast.AI !

Subscribe to our newsletter


There are currently no comments.