ELMo源码阅读流水账

发布于 2020-03-01  349 次阅读


一个关于allennlp中ELMo实现的源码阅读笔记


前向

1. 从输入开始

输入是经过分词,去停用词等操作后的语料库,可以有一些特殊符号。——sentence_lists = [ ['I', 'have', 'a', 'dog', ',', 'it', 'is', 'so', 'cute'], ['That', 'is', 'a', 'question'], ['an']]
下一步则是通过batch_to_ids()函数处理得到对应的character id,本质上是一个查询操作。

from allennlp.data.token_indexers.elmo_indexer import ELMoCharacterMapper
from allennlp.modules.elmo import Elmo, batch_to_ids
from allennlp.nn.util import  add_sentence_boundary_token_ids

import numpy
import torch

# use batch_to_ids to convert sentences to character ids
sentence_lists = [
    ['I', 'have', 'a', 'dog', ',', 'it', 'is', 'so', 'cute'],
    ['That', 'is', 'a', 'question'],
    ['an']
]
character_ids = batch_to_ids(sentence_lists)
print(character_ids)
tensor([[[259,  74, 260,  ..., 261, 261, 261],
         [259, 105,  98,  ..., 261, 261, 261],
         [259,  98, 260,  ..., 261, 261, 261],
         ...,
         [259, 106, 116,  ..., 261, 261, 261],
         [259, 116, 112,  ..., 261, 261, 261],
         [259, 100, 118,  ..., 261, 261, 261]],

        [[259,  85, 105,  ..., 261, 261, 261],
         [259, 106, 116,  ..., 261, 261, 261],
         [259,  98, 260,  ..., 261, 261, 261],
         ...,
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0]],

        [[259,  98, 111,  ..., 261, 261, 261],
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         ...,
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0]]])
# 对应 'I', 'have' ,'a'
print("I have a")
print(character_ids[0][:3])
I have a
tensor([[259,  74, 260, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261],
        [259, 105,  98, 119, 102, 260, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261],
        [259,  98, 260, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261]])
print(character_ids[1][5])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0])

从输出要注意两点:

  • 输出形式为 torch.tensor(len(batch), max_sentence_length, max_word_length), 在allennlp的源码中,max_word_length默认为50,其他句子没有最长句长的部分全为0
  • 这些数字的来源:
    max_word_length = 50
    # char ids 0-255 come from utf-8 encoding bytes
    # assign 256-300 to special chars
    beginning_of_sentence_character = 256  # <begin sentence>
    end_of_sentence_character = 257  # <end sentence>
    beginning_of_word_character = 258  # <begin word>
    end_of_word_character = 259  # <end word>
    padding_character = 260 # <padding>

    'I' 的utf-8应为73,但是allennlp做mapping的时候最后全加了1(不清楚这么做的原因),所以形式应该为:'start of char', 'char Id', 'end of char','padding', 'padding',....


2. 进入Char-CNN

进入网络的向量首先需要对每个句子增加BOS(beginning of sentence)与EOS(end of sentence),结果上来说input从
(batch_size, sequence_length, 50)变为(batch_size, sequence_length + 2, 50);需要有类似变化的还有mask向量,mask向量作用是显示某个sentence的长度。

bos_token = '<S>'
eos_token = '</S>'
mask = ((character_ids > 0).long().sum(dim=-1) > 0).long()
print(mask)
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 0]])
character_ids_with_bos_eos, mask_with_bos_eos = add_sentence_boundary_token_ids(
                character_ids,
                mask,
                torch.from_numpy(numpy.array(ELMoCharacterMapper.beginning_of_sentence_characters) + 1),
                torch.from_numpy(numpy.array(ELMoCharacterMapper.end_of_sentence_characters) + 1 )
        )
print("BOS, an, EOS")
print(character_ids_with_bos_eos[2][:4])
print(mask_with_bos_eos)
BOS, an, EOS
tensor([[259, 257, 260, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261],
        [259,  98, 111, 260, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261],
        [259, 258, 260, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
         261, 261, 261, 261, 261, 261, 261, 261],
        [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])

接下来是把char_id转换为所对应的char_embedding,size由(batch_size, sequence_length + 2, 50)变为(batch_size * (sequence_length + 2), max_chars_per_token, embed_dim),从allennlp来看,这个char_embedding既可以是在hdf5中定义,也可以是通过学习得到的embedding

P.S.从别人的hdf5文件来看,char_embedding共有262个(261+1, allennlp源码在初始化时261是从hdf5里读取的,1是产生char_embedding时增加的,并且元素全为0),每个embedding为16维向量, 这一项也可配置。

max_chars_per_token = 50
# 图方便就对char_embedding随机初始化了
char_embedding_weights = torch.rand(263, 4)

character_embedding = torch.nn.functional.embedding(
                character_ids_with_bos_eos.view(-1, max_chars_per_token),
                char_embedding_weights
        )
print("EOS的embedding: ")
print(character_embedding[0][:6])
print("have的emmapedding: ")
print(character_embedding[2][:8])
EOS的embedding: 
tensor([[0.7641, 0.0505, 0.6018, 0.2663],
        [0.3650, 0.8026, 0.5612, 0.6900],
        [0.1094, 0.7314, 0.7448, 0.6157],
        [0.4000, 0.0209, 0.3658, 0.5280],
        [0.4000, 0.0209, 0.3658, 0.5280],
        [0.4000, 0.0209, 0.3658, 0.5280]])
have的emmapedding: 
tensor([[0.7641, 0.0505, 0.6018, 0.2663],
        [0.8820, 0.0339, 0.5742, 0.1159],
        [0.0041, 0.7074, 0.4636, 0.5560],
        [0.3604, 0.4848, 0.0683, 0.6330],
        [0.6414, 0.4172, 0.2827, 0.0735],
        [0.1094, 0.7314, 0.7448, 0.6157],
        [0.4000, 0.0209, 0.3658, 0.5280],
        [0.4000, 0.0209, 0.3658, 0.5280]])

进入Char-CNN前先转置一下, (batch_size * sequence_length, embed_dim, max_chars_per_token)

character_embedding = torch.transpose(character_embedding, 1, 2)

卷积网络是由窗口大小不同的一些卷积核构成的,之后每个卷积核对应的输出经过一个max池化层,size变为(batch_size * (sequence_length + 2), n_filters)

卷积核也是通过文件配置的,allennlp中给出了示意的json文件

{'char_cnn': {
                'activation': 'relu',
                'embedding': {'dim': 4},
                'filters': [[1, 4], [2, 8], [3, 16], [4, 32], [5, 64]],
                'max_characters_per_token': 50,
                'n_characters': 262,
                'n_highway': 2
                }
            }

filters一项中,[1, 4]指窗口宽度与卷积核数量,若按照上述示意json配置,则n_filters = 4 + 8+ 16 +32 + 64

部分源码:

for i in range(len(self._convolutions)):
    conv = getattr(self, 'char_conv_{}'.format(i))
    convolved = conv(character_embedding)
    # (batch_size * sequence_length, n_filters for this width)
    convolved, _ = torch.max(convolved, dim=-1)
    convolved = activation(convolved)
    convs.append(convolved)         

# (batch_size * sequence_length, n_filters)
token_embedding = torch.cat(convs, dim=-1)

最后经过highway网络和projection

#  apply the highway layers (batch_size * (sequence_length + 2), n_filters)
  token_embedding = self._highways(token_embedding)

# final projection  (batch_size * (sequence_length + 2), embedding_dim)  embedding_dim一般是512维
  token_embedding = self._projection(token_embedding)

3. 进入两层BiLSTM

每层由两个不同方向的LSTM,输出的时候就由这两个LSTM concatenate,每层的cell的个数(一般是4096)是超过输入的维度(一般是512)的,所以一层BiLSTM在输出到下一个BiLSTM前,需要经过一层projection映射到embedding_dim的维度。

两层BiLSTM之间有残差网络相连,也就是说第一层的输出不仅作为第二层的输入,同时也会和第二层的输出相加。

...
# Skip connections, just adding the input to the output.
if layer_index != 0:
    forward_output_sequence += forward_cache
    backward_output_sequence += backward_cache
...

阅读源码发现Pytorch对处理RNN-Base网络的一些问题,先做记录。

  • sort_and_run_forward():
    def sort_and_run_forward(self,
                 module: Callable[[PackedSequence, Optional[RnnState]],
                                  Tuple[Union[PackedSequence, torch.Tensor], RnnState]],
                 inputs: torch.Tensor,
                 mask: torch.Tensor,
                 hidden_state: Optional[RnnState] = None):

    由于Pytorch处理RNN-Base的网络要先按照batch(句子长度或其他)排序,这个函数已经做好了排序,随后调用pack_padded_sequence(),获得PackedSequence类,最后PackedSequence和LSTM的hidden_state作为module的输入,hidden_state也就是网络在上一个timestep的输出,初始化时直接设置h0 = None即可。

    # Actually call the module on the sorted PackedSequence.
    module_output, final_states = module(packed_sequence_input, initial_states)

    pack_padded_sequence()的作用是什么呢?参考这篇文章
    pack_padded_sequence()对应的是pad_packed_sequence()互逆,如果设置batch_first = True,则指明第一维是batch_size


阿克西斯上没有什么重要的东西