Which *BERT? A Survey Organizing Contextualized Encoders
short explaination of long paper
- Peters et. al. (2018, ELMo) won the NAACL best paper award for creating strong performing, task-agnostic sentence representations due to large scale unsupervised pretraining .
- Days later, its performance was surpassed by Radford et al. (2018) which boasted representations beyond a single sentence and finetuning flexibility
- So, we have huge competition among models, within no time a new model comes which sets the new benchmark.
- Now, we have this question,
What, besides state-of-the-art, does the newest paper contribute? Also which encoder should we use?
Goals of this survey paper
- Outline the areas of progress, relate contributions in text encoders to ideas from other fields
- It is also helpful for practitioners and researchers on which encoder to choose.
- This survey paper doesn’t intend to compare specific model metrics.
- This survey especially looks at the ideas and the progress in the scientific discourse for text representations by distinguishing their differences.
This paper is organized as follows:
- Providing brief background on encoding, training and evaluating text representations.
- Identifying and analyzing two classes of pretraining objectives.
- Exploring and faster and smaller models and architectures in both training and inferences.
- Impact of both quality and quantity of pretraining data.
- Discussing efforts on probing encoders and representations with respect to linguistic knowledge.
- Describing the efforts into training and evaluating multilingual representations.
Publicizing negative results in this area is very important because we need lot of compute power, time to train these models and to ensure evaluation reproducibility. Also, probing studies need to focus not only on models and tasks but also on pretraining data.
Questions raised for users of contextulized encoders.
- Whether the compute requirement of models is worth the benefits?
- ELMo is a BiLSTM model with a language model objective for the next (or previous) token given the forward or backward history.
- The idea of looking at the full context was further refined as cloze task (fill-in-the-blank task) or as a denoising Masked Language Modelling objective
- MLM replaces some tokens with a [mask] symbol and provides both right and left contexts (bidirectional context) for predicting the masked tokens.
- The bidirectionality is key to outperforming a unidirectional language model on a large suite of natural language understanding benchmarks.
WILL BE UPDATED SOON.