gianlp.models.text_representations.trainable_word_embedding_sequence.TrainableWordEmbeddingSequence
- class gianlp.models.text_representations.trainable_word_embedding_sequence.TrainableWordEmbeddingSequence(tokenizer: Callable[[str], List[str]], embedding_dimension: int, word2vec_src: Optional[Union[str, KeyedVectors]] = None, sequence_maxlen: int = 20, min_freq_percentile: float = 5, max_vocabulary: Optional[int] = None, pretrained_trainable: bool = False, random_state: int = 42)
Bases:
TextRepresentation
Trainable word embedding sequence input
- Variables
_keras_model – Keras model built from processing the text input
_word2vec – the gensim keyedvectors object that contains the word embedding matrix
_tokenizer – word tokenizer function
_pretrained_trainable – if the pretrained vectors are trainable
_sequence_maxlen – the max length of an allowed sequence
_embedding_dimension – target embedding dimension
_min_freq_percentile – the minimum percentile of the frequency to consider a word part of the vocabulary
_max_vocabulary – optional maximum vocabulary size
_word_indexes – the word to index dictionary
_random_state – random seed
- Parameters
tokenizer – A tokenizer function that transforms each string into a list of string tokens. The tokens transformed should match the keywords in the pretrained word embeddings. The function must support serialization through pickle
word2vec_src – optional path to word2vec format .txt file or gensim KeyedVectors. if provided the common words from the corpus that are in the embedding will have this vectors assigned
min_freq_percentile – the minimum percentile of the frequency to consider a word part of the vocabulary
max_vocabulary – optional maximum vocabulary size
embedding_dimension – the dimension of the target embedding
sequence_maxlen – The maximum allowed sequence length
pretrained_trainable – if the vectors pretrained will also be trained. ignored if word2vec_src is None
random_state – the random seed used for random processes
- Raises
ValueError – if a pretrained embeddings is fed and it’s dimension does not match the one in embedding_dimension
Methods
Builds the whole chain of models in a recursive manner using the functional API.
Deserializes a model
Parallelizable wrapper for the tokenizer
Given texts returns the array representation needed for forwarding the keras model
Serializes the model to be deserialized with the deserialize method
Function for tokenizing texts
Attributes
Method for getting all models that serve as input.
Returns the shapes of the inputs of the model
Returns the output shape of the model
Computes the total amount of trainable weights
Computes the total amount of weights
- build(texts: Union[List[str], Series, Dict[str, List[str]], DataFrame]) None
Builds the whole chain of models in a recursive manner using the functional API. Some operations may need the model to be built.
- Parameters
texts – the texts for building if needed, some models have to learn from a sample corpus before working
- Raises
ValueError – If the multi-text input keys do not match with the ones in a multi-text model
- classmethod deserialize(data: bytes) BaseModel
Deserializes a model
- Parameters
data – the data for deserializing
- Returns
a BaseModel object
- static get_bytes_from_model(model: Model, copy: bool = False) bytes
Transforms a keras model into bytes
- Parameters
model – the keras model
copy – whether to copy the model before saving. copying the model is needed for complex nested models because the keras save/load can fail
- Returns
a byte array
- static get_model_from_bytes(data: bytes) Model
Given bytes from keras model serialized with get_bytes_from_model method returns the model
- Parameters
data – the model bytes
- Returns
a keras model
- property inputs: ModelInputsWrapper
Method for getting all models that serve as input. All TextRepresentation have no models as an input.
- Returns
a list or list of tuples containing BaseModel objects
- property inputs_shape: Union[List[ModelIOShape], ModelIOShape]
Returns the shapes of the inputs of the model
- Returns
a list of shape tuple or shape tuple
- static parallel_tokenizer(text: str, tokenizer: Callable[[str], List[str]], sequence_maxlength: Optional[int] = None) List[str]
Parallelizable wrapper for the tokenizer
- Parameters
text – the text to tokenize
tokenizer – the tokenizer
sequence_maxlength – optional sequence maxlength.
- Returns
a list of lists with string tokens
- serialize() bytes
Serializes the model to be deserialized with the deserialize method
- Returns
a byte array
- static tokenize_texts(texts: Union[List[str], Series], tokenizer: Callable[[str], List[str]], sequence_maxlength: Optional[int] = None) List[List[str]]
Function for tokenizing texts
- Parameters
texts – the texts to tokenize
tokenizer – the tokenizer
sequence_maxlength – optional sequence maxlength.
- Returns
a list of lists with string tokens
- property trainable_weights_amount: Optional[int]
Computes the total amount of trainable weights
- Returns
the total amount of trainable weights or none if not built
- property weights_amount: Optional[int]
Computes the total amount of weights
- Returns
the total amount of weights or none if not built
- preprocess_texts(texts: Union[List[str], Series]) Union[List[ndarray], ndarray]
Given texts returns the array representation needed for forwarding the keras model
- Parameters
texts – the texts to preprocess
- Returns
a numpy array of shape (#texts, _sequence_maxlen)
- property outputs_shape: ModelIOShape
Returns the output shape of the model
- Returns
a list of shape tuple or shape tuple