gianlp.models.text_representations.chars_per_word_sequence.CharPerWordEmbeddingSequence

class gianlp.models.text_representations.chars_per_word_sequence.CharPerWordEmbeddingSequence(tokenizer: Callable[[str], List[str]], embedding_dimension: int = 256, word_maxlen: int = 30, char_maxlen: int = 12, min_freq_percentile: int = 5, random_state: int = 42)

Bases: TextRepresentation

Char per word sequence. A wrapper for instantiating a per chunk sequencer.

Variables

_chunker – function used for chunking the texts
_sequencer – text input use for sequencing each chunk
_chunking_maxlen – the maximum length in chunks for a text

Parameters

tokenizer – a tokenizer function that transforms each string into a list of string tokens the function must support serialization through pickle
embedding_dimension – The char embedding dimension
word_maxlen – the max length for word sequences
char_maxlen – the max length for chars within a word
min_freq_percentile – minimum percentile of the frequency for keeping a char. If a char has a frequency lower than this percentile it would be treated as unknown.
random_state – random seed

Returns

a PerChunkSequencer with your tokenizer and a CharEmbeddingSequence as sequencer

Methods

`build`	Builds the whole chain of models in a recursive manner using the functional API.
`deserialize`	Deserializes a model
`parallel_tokenizer`	Parallelizable wrapper for the tokenizer
`preprocess_texts`	Given texts returns the array representation needed for forwarding the keras model
`serialize`	Serializes the model to be deserialized with the deserialize method
`tokenize_texts`	Function for tokenizing texts

Attributes

`inputs`	Method for getting all models that serve as input.
`inputs_shape`	Returns the shapes of the inputs of the model
`outputs_shape`	Returns the output shape of the model
`trainable_weights_amount`	Computes the total amount of trainable weights
`weights_amount`	Computes the total amount of weights

build(texts: Union[List[str], Series, Dict[str, List[str]], DataFrame]) → None

Builds the whole chain of models in a recursive manner using the functional API. Some operations may need the model to be built.

Parameters: texts – the texts for building if needed, some models have to learn from a sample corpus before working
Raises: ValueError – If the multi-text input keys do not match with the ones in a multi-text model

classmethod deserialize(data: bytes) → BaseModel

Deserializes a model

Parameters: data – the data for deserializing
Returns: a BaseModel object

static get_bytes_from_model(model: Model, copy: bool = False) → bytes

Transforms a keras model into bytes

Parameters

model – the keras model
copy – whether to copy the model before saving. copying the model is needed for complex nested models because the keras save/load can fail

Returns

a byte array

static get_model_from_bytes(data: bytes) → Model

Given bytes from keras model serialized with get_bytes_from_model method returns the model

Parameters: data – the model bytes
Returns: a keras model

property inputs: ModelInputsWrapper

Method for getting all models that serve as input. All TextRepresentation have no models as an input.

Returns: a list or list of tuples containing BaseModel objects

property inputs_shape: Union[List[ModelIOShape], ModelIOShape]

Returns the shapes of the inputs of the model

Returns: a list of shape tuple or shape tuple

abstract property outputs_shape: ModelIOShape

Returns the output shape of the model

Returns: a list of shape tuple or shape tuple

static parallel_tokenizer(text: str, tokenizer: Callable[[str], List[str]], sequence_maxlength: Optional[int] = None) → List[str]

Parallelizable wrapper for the tokenizer

Parameters

text – the text to tokenize
tokenizer – the tokenizer
sequence_maxlength – optional sequence maxlength.

Returns

a list of lists with string tokens

abstract preprocess_texts(texts: Union[List[str], Series, Dict[str, List[str]], DataFrame]) → Union[List[ndarray], ndarray]

Given texts returns the array representation needed for forwarding the keras model

Parameters: texts – the texts to preprocess
Returns: a numpy array or list of numpy arrays representing the texts

serialize() → bytes

Serializes the model to be deserialized with the deserialize method

Returns: a byte array

static tokenize_texts(texts: Union[List[str], Series], tokenizer: Callable[[str], List[str]], sequence_maxlength: Optional[int] = None) → List[List[str]]

Function for tokenizing texts

Parameters

texts – the texts to tokenize
tokenizer – the tokenizer
sequence_maxlength – optional sequence maxlength.

Returns

a list of lists with string tokens

property trainable_weights_amount: Optional[int]

Computes the total amount of trainable weights

Returns: the total amount of trainable weights or none if not built

property weights_amount: Optional[int]

Computes the total amount of weights

Returns: the total amount of weights or none if not built