Quickstart: Binary Classifier Tutorial

We are going to build a binary classifier for the SMS Spam collection. We start by downloading it.

!curl -O https://raw.githubusercontent.com/justmarkham/DAT5/master/data/SMSSpamCollection.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  466k  100  466k    0     0   944k      0 --:--:-- --:--:-- --:--:--  946k
import pandas as pd
dataset = pd.read_csv('SMSSpamCollection.txt', sep='\t', header=None, names=['label', 'text'])
print(dataset.sample(5))
     label                                               text
4898   ham  I cant pick the phone right now. Pls send a me...
3023   ham                        How dare you change my ring
333   spam  Call Germany for only 1 pence per minute! Call...
3934   ham                             Playin space poker, u?
12    spam  URGENT! You have won a 1 week FREE membership ...

Across tutorials we are going to build different classifier architectures, we will start with the simpler one: char embedding sequence

Char embedding sequence classifier

from gianlp.models import CharEmbeddingSequence, RNNDigest, KerasWrapper
WARNING:nlp_builder:The NLP builder disables all tensorflow-related logging

We create a char embedding sequences for the texts, with a dimension for each char embedding of 32 and a sequence maxlen that matches percentile 80 of texts lengths

Text representations

help(CharEmbeddingSequence.__init__)
Help on function __init__ in module gianlp.models.text_representations.char_embedding_sequence:

__init__(self, embedding_dimension: int = 256, sequence_maxlen: int = 80, min_freq_percentile: int = 5, random_state: int = 42)
    :param embedding_dimension: The char embedding dimension
    :param sequence_maxlen: The maximum allowed sequence length
    :param min_freq_percentile: minimum percentile of the frequency for keeping a char.
                                If a char has a frequency lower than this percentile it
                                would be treated as unknown.
    :param random_state: random seed
char_emb = CharEmbeddingSequence(embedding_dimension=32, sequence_maxlen=dataset['text'].str.len().quantile(0.8))

We can see that the output shape of the char embedding is a sequence of at most 137 chars with 32 dimensions each

char_emb.outputs_shape
(137, 32), float32

We also have an input shape for interacting with keras models

char_emb.inputs_shape
(137,), int32

Sequence digest

The output sequence will be colapsed in a single state using RNNs

help(RNNDigest.__new__)
Help on function __new__ in module gianlp.models.rnn_digest:

__new__(cls, inputs: Union[List[ForwardRef('BaseModel')], List[Tuple[str, List[ForwardRef('BaseModel')]]], gianlp.models.base_model.BaseModel], units_per_layer: int, rnn_type: str, stacked_layers: int = 1, masking: bool = True, bidirectional: bool = False, random_seed: int = 42, **kwargs)
    :param inputs: the inputs of the model
    :param units_per_layer: the amount of units per layer
    :param rnn_type: the type of rnn, could be "rnn", "gru" or "lstm"
    :param stacked_layers: the amount of layers to stack, 1 by default
    :param masking: if apply masking with 0 to the sequence
    :param bidirectional: if it's bidirectional
    :param random_seed: the seed for random processes
    :param kwargs: extra arguments for the rnn layers
rnn_digest = RNNDigest(char_emb, units_per_layer=40, rnn_type='gru', stacked_layers=2)

The output shape is now a single vector

rnn_digest.outputs_shape
WARNING:nlp_builder:If the model and wrapper inputs mismatch it will only be noticed when building, before that output shape is an estimate and does not assert inputs.
(40,), float32

Binary classifier

Now we just simply build a binary classifier with an input of 40 floats and compile the keras model as always

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential()

model.add(Dense(20, input_shape=(40,), activation='tanh'))
model.add(Dense(20, activation='tanh'))
model.add(Dense(20, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
help(KerasWrapper.__init__)
Help on function __init__ in module gianlp.models.keras_wrapper:

__init__(self, inputs: Union[List[ForwardRef('BaseModel')], List[Tuple[str, List[ForwardRef('BaseModel')]]], gianlp.models.base_model.BaseModel], wrapped_model: keras.engine.training.Model, **kwargs)
    :param inputs: the models that are the input of this one. Either a list containing model inputs one by one or a
    dict indicating which text name is assigned to which inputs.
    If a list, all should have multi-text input or don't have it. If it's a dict all shouldn't have multi-text
    input.
    :param wrapped_model: the keras model to wrap.
                            if it has multiple inputs, inputs should be a list and have the same len
    :param random_seed: random seed used in training
    :raises:
        ValueError:
        - When the wrapped model is not a keras model
        - When the keras model to wrap does not have a defined input shape
        - When inputs is a list of models and some of the models in the input have multi-text input and others
        don't.
        - When inputs is a list of tuples and any of the models has multi-text input.
        - When inputs is a list of tuples with length one
        - When inputs is a list containing some tuples of (str, model) and some models
        - When the wrapped model has multiple inputs and the inputs don't have the same length as the inputs in
        wrapped model
model = KerasWrapper(rnn_digest, model)
model.outputs_shape
WARNING:nlp_builder:If the model and wrapper inputs mismatch it will only be noticed when building, before that output shape is an estimate and does not assert inputs.
WARNING:nlp_builder:If the model and wrapper inputs mismatch it will only be noticed when building, before that output shape is an estimate and does not assert inputs.

(1,), float32

We can see a model summary

print(model)
WARNING:nlp_builder:If the model and wrapper inputs mismatch it will only be noticed when building, before that output shape is an estimate and does not assert inputs.
WARNING:nlp_builder:If the model and wrapper inputs mismatch it will only be noticed when building, before that output shape is an estimate and does not assert inputs.
WARNING:nlp_builder:If the model and wrapper inputs mismatch it will only be noticed when building, before that output shape is an estimate and does not assert inputs.
WARNING:nlp_builder:If the model and wrapper inputs mismatch it will only be noticed when building, before that output shape is an estimate and does not assert inputs.
        Model        |      Inputs shape     |      Output shape     |Trainable|  Total  |    Connected to
                     |                       |                       | weights | weights |
==============================================================================================================
7fd2a8539c70 CharEmbe|     (137,), int32     |   (137, 32), float32  |    ?    |    ?    |
7fd2300e7520 KerasWra|   (137, 32), float32  |     (40,), float32    |    ?    |    ?    |7fd2a8539c70 CharEmb
7fd1e0413520 KerasWra|     (40,), float32    |     (1,), float32     |    ?    |    ?    |7fd2300e7520 KerasWr
==============================================================================================================
                     |                       |                       |    ?    |    ?    |

Note all the warnings and the ? symbols at the weights. This is because the model has not yet been built, the numbers shown will only work if the models are connected properly and the weights are impossible to know if the representations are not built.

Train-test split

dataset = dataset.sample(len(dataset))
train = dataset.iloc[:int(len(dataset)*0.8)]
test = dataset.iloc[-int(len(dataset)*0.8):]

Our models need to be built always with a corpus of texts for the text representations to learn how to preprocess. In the case of the char embedding it needs to learn the most common chars to know how many vector will train.

model.build(train['text'])

We can see the complete summary now

print(model)
        Model        |      Inputs shape     |      Output shape     |Trainable|  Total  |    Connected to
                     |                       |                       | weights | weights |
==============================================================================================================
7fd2a8539c70 CharEmbe|     (137,), int32     |   (137, 32), float32  |   3392  |   3392  |
7fd2300e7520 KerasWra|   (137, 32), float32  |     (40,), float32    |  22112  |  22112  |7fd2a8539c70 CharEmb
7fd1e0413520 KerasWra|     (40,), float32    |     (1,), float32     |  23793  |  23793  |7fd2300e7520 KerasWr
==============================================================================================================
                     |                       |                       |  23793  |  23793  |

Training

type(model)
gianlp.models.keras_wrapper.KerasWrapper
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(patience=2, mode='max', monitor='val_accuracy', restore_best_weights=True)
help(model.fit)
Help on method fit in module gianlp.models.trainable_model:

fit(x: Union[Generator[Tuple[Union[List[str], pandas.core.series.Series, Dict[str, List[str]], pandas.core.frame.DataFrame], Union[List[<built-in function array>], <built-in function array>]], NoneType, NoneType], List[str], pandas.core.series.Series, Dict[str, List[str]], pandas.core.frame.DataFrame] = None, y: Optional[<built-in function array>] = None, batch_size: int = 32, epochs: int = 1, verbose: Union[str, int] = 'auto', callbacks: List[keras.callbacks.Callback] = None, validation_split: Optional[float] = 0.0, validation_data: Union[Generator[Tuple[Union[List[str], pandas.core.series.Series, Dict[str, List[str]], pandas.core.frame.DataFrame], Union[List[<built-in function array>], <built-in function array>]], NoneType, NoneType], Tuple[Union[List[str], pandas.core.series.Series, Dict[str, List[str]], pandas.core.frame.DataFrame], Union[List[<built-in function array>], <built-in function array>]], NoneType] = None, steps_per_epoch: Optional[int] = None, validation_steps: Optional[int] = None) -> keras.callbacks.History method of gianlp.models.keras_wrapper.KerasWrapper instance
    Fits the model

    :param x: Input data. Could be:
        1. A generator that yields (x, y) where x is any valid format for x and y is the target numpy array
        2. A list of texts
        3. A pandas Series
        4. A pandas Dataframe
        5. A dict of lists containing texts
    :param y: Target, ignored if x is a generator. Numpy array.
    :param batch_size: Batch size for training, ignored if x is a generator
    :param epochs: Amount of epochs to train
    :param verbose: verbose mode for Keras training
    :param callbacks: list of Callback objects for Keras model
    :param validation_split: the proportion of data to use for validation, ignored if x is a generator
    :param validation_data: Validation data. Could be:
        1. A tuple containing (x, y) where x is a list of text and y is the target numpy array
        2. A generator that yields (x, y) where x is a list of texts and y is the target numpy array
    :param steps_per_epoch: Amount of generator steps to consider an epoch as finished. Ignored if x is not a
    generator
    :param validation_steps: Amount of generator steps to consider to feed each validation evaluation.
                            Ignored if validation_data is not a generator
    :return: A History object. Its History.history attribute is a record of training loss values and metrics values
    at successive epochs, as well as validation loss values and validation metrics values (if applicable).
hst = model.fit(train['text'], train['label'].map(lambda x: 1 if x=='spam' else 0).values,
                batch_size=256, epochs=30, validation_split=0.1,
                callbacks=[early_stopping])
Epoch 1/30
15/15 [==============================] - 10s 162ms/step - loss: 0.4862 - accuracy: 0.8016 - val_loss: 0.3075 - val_accuracy: 0.8828
Epoch 2/30
15/15 [==============================] - 1s 35ms/step - loss: 0.3187 - accuracy: 0.8805 - val_loss: 0.2899 - val_accuracy: 0.8906
Epoch 3/30
15/15 [==============================] - 0s 32ms/step - loss: 0.2495 - accuracy: 0.9112 - val_loss: 0.2561 - val_accuracy: 0.9102
Epoch 4/30
15/15 [==============================] - 0s 32ms/step - loss: 0.2233 - accuracy: 0.9245 - val_loss: 0.1870 - val_accuracy: 0.9336
Epoch 5/30
15/15 [==============================] - 0s 31ms/step - loss: 0.1731 - accuracy: 0.9430 - val_loss: 0.1805 - val_accuracy: 0.9414
Epoch 6/30
15/15 [==============================] - 0s 32ms/step - loss: 0.1335 - accuracy: 0.9589 - val_loss: 0.1678 - val_accuracy: 0.9609
Epoch 7/30
15/15 [==============================] - 0s 31ms/step - loss: 0.1227 - accuracy: 0.9651 - val_loss: 0.1681 - val_accuracy: 0.9453
Epoch 8/30
15/15 [==============================] - 0s 33ms/step - loss: 0.0929 - accuracy: 0.9750 - val_loss: 0.1784 - val_accuracy: 0.9531

Testing

help(model.predict)
Help on method predict in module gianlp.models.trainable_model:

predict(x: Union[Generator[Union[List[str], pandas.core.series.Series, Dict[str, List[str]], pandas.core.frame.DataFrame], NoneType, NoneType], List[str], pandas.core.series.Series, Dict[str, List[str]], pandas.core.frame.DataFrame], steps: Optional[int] = None, **kwargs: Any) -> Union[List[<built-in function array>], <built-in function array>] method of gianlp.models.keras_wrapper.KerasWrapper instance
    Predicts using the model

    :param x: could be:
        1. A list of texts
        2. A pandas Series
        3. A pandas Dataframe
        4. A dict of lists containing texts
        5. A generator of any of the above formats
    :param steps: steps for the generator, ignored if x is a list
    :param kwargs: arguments for keras predict method
    :return: the output of the keras model
test_preds = model.predict(test['text'])
test_preds
array([[0.01580496],
       [0.01534021],
       [0.9671472 ],
       ...,
       [0.01837054],
       [0.01742908],
       [0.01805047]], dtype=float32)
import numpy as np

print("Accuracy: ",
      np.equal((test['label'].map(lambda x: 1 if x=='spam' else 0).values==1), test_preds.flatten()>0.5).sum()/len(test))
Accuracy:  0.9614090195198564

Serialization

The model is serialized as bytes for loading later

data = model.serialize()
type(data), len(data)
(bytes, 661113)
from gianlp.models import BaseModel
model2 = BaseModel.deserialize(data)

When serialized the model lost some of our own library layers, this is optional and is meant to simplify the graph and make the deserialization faster.

print(model2)
        Model        |      Inputs shape     |      Output shape     |Trainable|  Total  |    Connected to
                     |                       |                       | weights | weights |
==============================================================================================================
7fd180ab8b20 CharEmbe|     (137,), int32     |   (137, 32), float32  |   3392  |   3392  |
7fd15bd90190 KerasWra|   (137, 32), float32  |     (1,), float32     |  23793  |  23793  |7fd180ab8b20 CharEmb
==============================================================================================================
                     |                       |                       |  23793  |  23793  |
test_preds2 = model2.predict(test['text'])
assert test_preds.flatten().tolist() == test_preds2.flatten().tolist()