gianlp.models.text_representations.trainable_word_embedding_sequence.TrainableWordEmbeddingSequence

class gianlp.models.text_representations.trainable_word_embedding_sequence.TrainableWordEmbeddingSequence(tokenizer: Callable[[str], List[str]], embedding_dimension: int, word2vec_src: Optional[Union[str, KeyedVectors]] = None, sequence_maxlen: int = 20, min_freq_percentile: float = 5, max_vocabulary: Optional[int] = None, pretrained_trainable: bool = False, random_state: int = 42)

Bases: TextRepresentation

Trainable word embedding sequence input

Variables
  • _keras_model – Keras model built from processing the text input

  • _word2vec – the gensim keyedvectors object that contains the word embedding matrix

  • _tokenizer – word tokenizer function

  • _pretrained_trainable – if the pretrained vectors are trainable

  • _sequence_maxlen – the max length of an allowed sequence

  • _embedding_dimension – target embedding dimension

  • _min_freq_percentile – the minimum percentile of the frequency to consider a word part of the vocabulary

  • _max_vocabulary – optional maximum vocabulary size

  • _word_indexes – the word to index dictionary

  • _random_state – random seed

Parameters
  • tokenizer – A tokenizer function that transforms each string into a list of string tokens. The tokens transformed should match the keywords in the pretrained word embeddings. The function must support serialization through pickle

  • word2vec_src – optional path to word2vec format .txt file or gensim KeyedVectors. if provided the common words from the corpus that are in the embedding will have this vectors assigned

  • min_freq_percentile – the minimum percentile of the frequency to consider a word part of the vocabulary

  • max_vocabulary – optional maximum vocabulary size

  • embedding_dimension – the dimension of the target embedding

  • sequence_maxlen – The maximum allowed sequence length

  • pretrained_trainable – if the vectors pretrained will also be trained. ignored if word2vec_src is None

  • random_state – the random seed used for random processes

Raises

ValueError – if a pretrained embeddings is fed and it’s dimension does not match the one in embedding_dimension

Methods

build

Builds the whole chain of models in a recursive manner using the functional API.

deserialize

Deserializes a model

parallel_tokenizer

Parallelizable wrapper for the tokenizer

preprocess_texts

Given texts returns the array representation needed for forwarding the keras model

serialize

Serializes the model to be deserialized with the deserialize method

tokenize_texts

Function for tokenizing texts

Attributes

inputs

Method for getting all models that serve as input.

inputs_shape

Returns the shapes of the inputs of the model

outputs_shape

Returns the output shape of the model

trainable_weights_amount

Computes the total amount of trainable weights

weights_amount

Computes the total amount of weights

build(texts: Union[List[str], Series, Dict[str, List[str]], DataFrame]) None

Builds the whole chain of models in a recursive manner using the functional API. Some operations may need the model to be built.

Parameters

texts – the texts for building if needed, some models have to learn from a sample corpus before working

Raises

ValueError – If the multi-text input keys do not match with the ones in a multi-text model

classmethod deserialize(data: bytes) BaseModel

Deserializes a model

Parameters

data – the data for deserializing

Returns

a BaseModel object

static get_bytes_from_model(model: Model, copy: bool = False) bytes

Transforms a keras model into bytes

Parameters
  • model – the keras model

  • copy – whether to copy the model before saving. copying the model is needed for complex nested models because the keras save/load can fail

Returns

a byte array

static get_model_from_bytes(data: bytes) Model

Given bytes from keras model serialized with get_bytes_from_model method returns the model

Parameters

data – the model bytes

Returns

a keras model

property inputs: ModelInputsWrapper

Method for getting all models that serve as input. All TextRepresentation have no models as an input.

Returns

a list or list of tuples containing BaseModel objects

property inputs_shape: Union[List[ModelIOShape], ModelIOShape]

Returns the shapes of the inputs of the model

Returns

a list of shape tuple or shape tuple

static parallel_tokenizer(text: str, tokenizer: Callable[[str], List[str]], sequence_maxlength: Optional[int] = None) List[str]

Parallelizable wrapper for the tokenizer

Parameters
  • text – the text to tokenize

  • tokenizer – the tokenizer

  • sequence_maxlength – optional sequence maxlength.

Returns

a list of lists with string tokens

serialize() bytes

Serializes the model to be deserialized with the deserialize method

Returns

a byte array

static tokenize_texts(texts: Union[List[str], Series], tokenizer: Callable[[str], List[str]], sequence_maxlength: Optional[int] = None) List[List[str]]

Function for tokenizing texts

Parameters
  • texts – the texts to tokenize

  • tokenizer – the tokenizer

  • sequence_maxlength – optional sequence maxlength.

Returns

a list of lists with string tokens

property trainable_weights_amount: Optional[int]

Computes the total amount of trainable weights

Returns

the total amount of trainable weights or none if not built

property weights_amount: Optional[int]

Computes the total amount of weights

Returns

the total amount of weights or none if not built

preprocess_texts(texts: Union[List[str], Series]) Union[List[ndarray], ndarray]

Given texts returns the array representation needed for forwarding the keras model

Parameters

texts – the texts to preprocess

Returns

a numpy array of shape (#texts, _sequence_maxlen)

property outputs_shape: ModelIOShape

Returns the output shape of the model

Returns

a list of shape tuple or shape tuple