Toxic Comments Classification project logo. Originally generated with Leonardo.Ai.
Summary
In response to the challenges posed by online abuse and harassment hindering open discourse, a data science project was undertaken to develop an effective moderation system. The project utilized a DistillBERT-based model for multi-label text classification, leveraging transfer learning techniques for deep learning. Built using PyTorch and Lightning frameworks, the model was trained on a dataset comprising 223,549 comments, encompassing both clean (non-toxic) and various forms of toxic content (toxic, severe toxic, obscene, threat, insult, identity hate). Rigorous state-of-the-art procedures were employed to ensure robustness and accuracy. The model achieved a balanced accuracy of 0.964 and an F1 score of 0.551 on the test set, demonstrating its efficacy in identifying and moderating potentially harmful comments. This project aims to contribute to the creation of safer and more inclusive online communities by facilitating constructive conversations while mitigating the impact of abusive behavior.
1 Setup
The computations in this project were performed in 2 environments:
local Windows 10 machine with NVIDIA GTX 950M GPU:
for regular computations and analysis;
allows using local tools (e.g., VS Code with its plugins, Quarto, etc.);
free of charge.
Colab Pro environment with either NVIDIA V100 or NVIDIA A100 GPU (depending on which one was available):
for resource-demanding computations like training deep learning (DL) models and inference;
allows using a powerful GPU;
paid service.
The necessary data, log and other files were synchronized via Google Drive. Find the Python-related specifications of each environment in the collapsible sections below. But the main difference was the version of Python and pandas.
Versions of Python and main libraries: local (Windows 10)
Code
%load_ext watermark%watermark --conda%watermark --python# Main deep learning packages%watermark -p torch,torchmetrics,torchinfo,transformers,lightning,tensorboard# Other main packages%watermark -p numpy,pandas,matplotlib,seaborn,sklearn,logging
%load_ext watermark%watermark --conda%watermark --python# Main deep learning packages%watermark -p torch,torchmetrics,torchinfo,transformers,lightning,tensorboard# Other main packages%watermark -p numpy,pandas,matplotlib,seaborn,sklearn,logging
This section describes the setup that was used in the Colab environment. I also installed or updated some packages following the requirements file but not all as updates would break compatibility with other packages required by Colab.
Code
import osfrom google.colab import drive, files
Code
drive.mount("/content/drive")
Code
%cd /content/drive/MyDrive/Colab-proj/TC-M4-proj2
/content/drive/MyDrive/Colab-proj/TC-M4-proj2
Code
!pwd
/content/drive/MyDrive/Colab-proj/TC-M4-proj2
1.2 Main Setup
Code: The main Python setup
# Automatically reload certain modules%reload_ext autoreload%autoreload 1# Plotting%matplotlib inline# Packages and modules -------------------------------# Utilitiesimport osimport warningsimport numpy as npimport loggingimport datetimeimport jsonfrom contextlib import contextmanagerfrom pathlib import Path# Data framesimport pandas as pd# EDA and plottingimport seaborn as snsimport matplotlib.pyplot as pltimport matplotlib.patches as mpatches# ML: preprocessingfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import ( ConfusionMatrixDisplay, multilabel_confusion_matrix, classification_report,)# Deep learningimport torchfrom torch.utils.data import DataLoader, Datasetfrom torchmetrics.classification import Accuracy, F1Score, ConfusionMatrixfrom torchinfo import summaryimport lightning as Lfrom lightning.pytorch.loggers import CSVLogger, TensorBoardLoggerfrom lightning.pytorch.callbacks.early_stopping import EarlyStoppingfrom lightning.pytorch.callbacks import LearningRateMonitor, ModelCheckpointfrom lightning.pytorch import seed_everythingfrom transformers import AutoModelForSequenceClassification, AutoTokenizer# Settings --------------------------------------------# Default plot optionsplt.rc("figure", titleweight="bold")plt.rc("axes", labelweight="bold", titleweight="bold")plt.rc("font", weight="normal", size=10)plt.rc("figure", figsize=(10, 3))# Pandas optionspd.set_option("display.max_rows", 1000)pd.set_option("display.max_columns", 300)pd.set_option("display.max_colwidth", 50) # Possible option: Nonepd.set_option("display.float_format", lambda x: f"{x:.2f}")pd.set_option("styler.format.thousands", ",")# Turn off the scientific notation for floating point numbers.np.set_printoptions(suppress=True)# ----------------------------------------------------------------------------
Function timestamp()
def timestamp():"""Print the current date and time."""print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
@contextmanagerdef suppress_certain_logs_and_warnings(level=logging.WARNING):"""Supress certain logging messages and warnings. Suppress logging messages from Lightning and PyTorch related to GPU and TPU as well as warning related to not using parallel data loading. Based on https://github.com/Lightning-AI/pytorch-lightning/issues/3431#issuecomment-1527945684 ``` logging.getLogger("lightning.pytorch.utilities.rank_zero").setLevel(logging.WARNING) disables the following output: GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs logging.getLogger("lightning.pytorch.accelerators.cuda").setLevel(logging.WARNING) disables the following output: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] ``` Args: level (int): Logging level. Default is `logging.WARNING`. """ log_rank_zero ="lightning.pytorch.utilities.rank_zero" log_cuda ="lightning.pytorch.accelerators.cuda"try:# Save the original log levels original_rank_zero_level = logging.getLogger(log_rank_zero).getEffectiveLevel() original_cuda_level = logging.getLogger(log_cuda).getEffectiveLevel()# Set the desired log levels logging.getLogger(log_rank_zero).setLevel(level) logging.getLogger(log_cuda).setLevel(level)# Suppress warningswith warnings.catch_warnings(): warnings.filterwarnings("ignore", ".*does not have many workers.*")yieldfinally:# Restore the original log levels logging.getLogger(log_rank_zero).setLevel(original_rank_zero_level) logging.getLogger(log_cuda).setLevel(original_cuda_level)
Function style_multilabel_confusion_matrix()
def style_multilabel_confusion_matrix(ax_, norm=False, precision=None, cmap="Spectral", label_font_size=None):"""Style multi-label confusion matrix plots created with `torchmetrics.ConfusionMatrix()`. Args: ax_ (list): List of axes objects. norm (bool, optional): Whether to normalize the confusion matrix. Default is False. precision (int, optional): Number of decimal places for the labels. cmap (str, optional): Colormap. Default is "Spectral". label_font_size (int, optional): Font size for the labels. Default is None. """for ax in ax_: ax.set_xticklabels(ax.get_xticklabels(), rotation=0) ax.set_yticklabels(ax.get_yticklabels(), rotation=0) ax.set_title(ax.get_title().replace("Label", ""))# Change cmap to reverse coolwarmfor im in ax.get_images(): im.set_cmap(cmap)if norm: im.set_clim(0, 1)# Override text formatting for labelsfor text in ax.texts:if precision isnotNone: text.set_text(f"{float(text.get_text()):.{precision}f}")if label_font_size isnotNone: text.set_fontsize(label_font_size)
Function value_counts_per_column()
def value_counts_per_column(data, columns):"""Create value counts for each label (column). Function for EDA of multilabel classification problems. Args: data (pd.DataFrame): Data. columns (list): Columns to create value counts for. Returns: pd.DataFrame: Value counts for each label. """ value_counts_dict = {col: data[col].value_counts() for col in columns} df = ( pd.DataFrame(value_counts_dict) .T.reset_index() .rename(columns={"index": "↓ label / value →"}) )return df
Function create_non_toxic_label()
toxic_labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]all_labels = ["non_toxic"] + toxic_labelsdef create_non_toxic_label(data, toxic_labels):"""Create label `non_toxic` where all the remaining classes are not present. Args: data (pd.DataFrame): Data. toxic_labels (list): List of toxic labels. Returns: pd.Series: Label `non_toxic`. """return data.apply(lambda x: 1ifall(x[col] ==0for col in toxic_labels) else0, axis=1 )
Function plot_label_counts()
def plot_label_counts(data, labels, figsize=(10, 5), title="Counts of Each Label"):"""Plot the counts of each label. Args: data (pd.DataFrame): Data. labels (list): Labels. figsize (tuple, optional): Figure size. Default is (10, 5). title (str, optional): Title of the plot. """ n = data.shape[0] label_counts = data[labels].sum() label_counts_sorted = label_counts.sort_values(ascending=True) ax = label_counts_sorted.plot( kind="barh", ec="black", color=["#ff9999"] *6+ ["lightblue"] )for i, v inenumerate(label_counts_sorted): ax.text( v +3, i,f" {v} ({v/n *100:,.1f}%)", color="black", va="center", fontweight="bold", ) toxic_patch = mpatches.Patch(color="#ff9999", ec="black", label="Various Toxic") non_toxic_patch = mpatches.Patch(color="lightblue", ec="black", label="Non-Toxic") plt.legend(handles=[non_toxic_patch, toxic_patch], loc="lower right") ax.set_xlim(right=ax.get_xlim()[1] *1.15) ax.set_xlabel("Count") ax.set_ylabel("Comment Type") ax.set_title(title)return {"ax": ax, "n": n, "label_counts": label_counts}
Function tokenize_and_encode()
def tokenize_and_encode(text, tokenizer, max_tokens=512):"""Tokenize and encode a text. Args: text (str): Text to tokenize and encode. tokenizer (transformers.AutoTokenizer): Tokenizer. max_tokens (int): Maximum number of tokens per sequence. Default is 512. Returns: dict: Tokenized and encoded text. """return tokenizer.encode_plus( text=text, add_special_tokens=True, max_length=max_tokens, padding="max_length", pad_to_max_length=True, return_attention_mask=True, return_token_type_ids=False, truncation=True, return_tensors="pt", )
Function create_trainer() (wrapper for class Trainer)
def create_trainer( log_model_name: str="model", max_epochs: int=50, log_dir: str="logs/", profiler: str|None="pytorch", log_every_n_steps: int=10, save_top_k_models: int=5, monitor_metric: str="val_loss", monitor_mode: str="min", accelerator: str="gpu", devices: list|int= [0], precision: int|str="16-mixed", patience: int=2, lr_logging_interval: str|None=False,**kwargs) -> L.Trainer:"""Create a Trainer object for training a model. A wrapper with default settings for the `pytorch_lightning.Trainer` class. Args: log_model_name (str): Name of the model for logging purposes. log_dir (str): Directory to save logs and checkpoints. max_epochs (int): Maximum number of epochs. profiler (str): Profiler to use. Default is "pytorch". log_every_n_steps (int): Log every n-th step. Default is 3. save_top_k_models (int): Save top k models. Default is 6. monitor (str): Metric to monitor. Default is "val_loss". monitor_mode (str): Mode of the monitored metric. Default is "min". accelerator (str): Accelerator to use. Default is "gpu". devices (list): List of devices to use. Default is [0]. precision (int | str): Precision to use. Default is "16-mixed". patience (int): Patience for Early Stopping. Default is 2. lr_logging_interval (str | None | False): Logging interval for learning rate. One of "step", "epoch", None or False. Default is False (do not log learning rate). **kwargs: Additional arguments for the Trainer. Returns: lightning.Trainer: Trainer object configured with specified settings. """ log_dir = Path(log_dir) log_dir.mkdir(parents=True, exist_ok=True) # Ensure log directory exists# Construct the Trainerwith suppress_certain_logs_and_warnings(logging.WARNING): callbacks = [ EarlyStopping( monitor=monitor_metric, mode=monitor_mode, patience=patience, check_finite=True, ), ModelCheckpoint( monitor=monitor_metric, mode=monitor_mode, filename=log_model_name+"--{epoch:03d}--{step:05d}--{val_loss:.2f}--{val_accuracy:.3f}", save_top_k=save_top_k_models, ), ]if (lr_logging_interval isNone) or (lr_logging_interval isnotFalse): callbacks.append(LearningRateMonitor(logging_interval=lr_logging_interval)) trainer = L.Trainer( profiler=profiler, max_epochs=max_epochs, accelerator=accelerator, devices=devices, precision=precision, default_root_dir=log_dir /"checkpoints/", logger=[ TensorBoardLogger(log_dir /"tensorboard_logs/", name=log_model_name), CSVLogger(log_dir /"csv_logs/", name=log_model_name), ], log_every_n_steps=log_every_n_steps, callbacks=callbacks,**kwargs )return trainer
Function read_metrics_log()
def read_metrics_log( log_path: str, model_name: str="", out_format: str="long") -> pd.DataFrame:"""Read the metrics log file and return a DataFrame. The function reads a csv file and extracts the relevant information about training and validation metrics, which were tracked. In long format (default) returned DataFrame has the following columns: - epoch (int): Epoch number. - set (str): Training or validation set. - accuracy (float): balanced accuracy (macro average). - f1 (float): F1 score (macro average). - loss (float): loss. - model (str): Name of the model. In wide format returned DataFrame has the following columns: - epoch - train_accuracy - train_f1 - train_loss - val_accuracy - val_f1 - val_loss - accuracy_diff (val_accuracy - train_accuracy) - f1_diff (val_f1 - train_f1) - loss_diff (val_loss - train_loss) - model Args: log_path (str): Path to the log file. model_name (str): Name of the model (value for column "model"). Default is "". out_format (str): Output format ("wide" or "long"). Default is "long". Returns: pd.DataFrame: DataFrame containing the metrics information. """ df = pd.read_csv(log_path) selected_columns = ["epoch","train_accuracy","train_f1","train_loss_epoch","val_accuracy","val_f1","val_loss_epoch", ] df = ( df[selected_columns] .dropna(subset=selected_columns[1:], how="all") .astype({"epoch": int}) ) df.columns = df.columns.str.replace("_epoch$", "", regex=True)if out_format =="wide":# Validation and training metrics are logged on separate rows.# It is assumed that validation metrics are logged first. df = df.sort_values(by=["epoch", "val_loss"], na_position="last") first_row_condition = pd.isna(df.train_loss.iloc[0]) and pd.notna( df.val_loss.iloc[0] ) second_row_condition = pd.notna(df.train_loss.iloc[1]) and pd.isna( df.val_loss.iloc[1] )ifall([first_row_condition, second_row_condition]): subset = ["val_accuracy", "val_f1", "val_loss"] df[subset] = df[subset].ffill() output = ( df.dropna(subset=["train_loss"]) .reset_index(drop=True) .assign( accuracy_diff=lambda df: df["val_accuracy"] - df["train_accuracy"], f1_diff=lambda df: df["val_f1"] - df["train_f1"], loss_diff=lambda df: df["val_loss"] - df["train_loss"], ) )else:raiseValueError("The log file is not in the expected format: ""there should be 2 rows (for validation and training results) ""and every second row in the same epoch must contain NaN values ""for the same metric. ""Fix the function, the file or use format='long'." )else:# Return long format (default) df_melted = pd.melt( df, id_vars=["epoch"], var_name="metric", value_name="value" ).dropna(subset=["value"]) df_melted[["set", "metric_type"]] = df_melted["metric"].str.split("_", n=1, expand=True ) df_pivoted = ( df_melted.pivot( index=["epoch", "set"], columns="metric_type", values="value" ) .reset_index() .rename(columns={"loss_epoch": "loss"}) .astype({"epoch": int}) ) df_pivoted.columns.name =None output = df_pivotedreturn output.assign(model=model_name)
2 Exploration and Pre-Processing
Performed locally
The code of the following section was performed locally on a Windows 10 machine.
2.1 Dataset
The dataset used in this project is the Toxic Comment Classification Challengedataset from Kaggle. It consists of a large number of Wikipedia comments, which have been labeled by human raters for toxic behavior. The types of toxicity are:
toxic
severe_toxic
obscene
threat
insult
identity_hate
During the analysis additional label non_toxic will be added.
Data comes in a ZIP archive that contains several CSV files, that were saved in data/raw directory: test.csv, test_labels.csv and train.csv will be used in this project.
The whole dataset came in several files. The first file train.csv contained data that was used for training and validation in this project. The file contained approximately 160,000 comments. Fig. 2.1 indicates class imbalance. It is seen that multi-row comments have leading and trailing quotes (this will be addressed later). No other issues were found in the data, but a label indicating non-toxic comments was added to the dataset.
Correlation and hierarchical clustering analysis results presented in Fig. 2.2 suggest that toxic, obscene, and insult labels have the strongest relationships. But this should be investigated in more detail and, unfortunately, this is out of scope for this project.
# Counts of each labelvalue_counts_per_column(data_train_val, all_labels).style.format(precision=0)
↓ label / value →
0
1
0
non_toxic
16225
143346
1
toxic
144277
15294
2
severe_toxic
157976
1595
3
obscene
151122
8449
4
threat
159093
478
5
insult
151694
7877
6
identity_hate
158166
1405
Code
plot_label_counts( data_train_val, all_labels, title="Counts of Each Label (Training + Validation Data)",)plt.show()
Fig. 2.1. Distribution of comment types (labels) in the training and validation data. Each comment can have multiple labels.
Code
g = sns.clustermap( data_train_val[all_labels].corr(), method="ward", cmap="RdBu", annot=True, annot_kws={"size": 8}, vmin=-1, vmax=1, figsize=(7, 5), cbar_pos=(0.94, 0.91, 0.03, 0.1), cbar_kws={"location": "right"}, dendrogram_ratio=(0.2, 0),)g.figure.suptitle("Correlation Between Presence of Comment Types", fontsize=12, y=1.03, x=0.55,)plt.show()
Fig. 2.2. Correlation between presence of comment types.
2.3 Test Data
The test data comes in 2 separate files: test.csv and test_labels.csv. The first one contains the comments, and in the second one there are the labels for the comments. There were approximately 153,000 comments in the test set but some labels are unknown and are marked with -1 thus in the testing part only around 64,000 were included. There is the same issue with multi-row comments as in the training data too and the distribution of labels is similar to the training set (see Fig. 2.3).
plot_label_counts(data_test, all_labels, title="Counts of Each Label (Test Data)")plt.show()
Fig. 2.3. Distribution of comment types (labels) in the test data. Each comment can have multiple labels.
2.4 Data Preprocessing: Training, Validation, and Test Sets
First, to evaluate the possibilities of doing a stratified split, the frequencies toxic comment label combinations in the training and validation sets were counted. The results are presented in Table 2.1. The combinations “1,1,0,1,1,0” and “1,1,0,1,0,1” appeared only once, so these labels were merged into one group “etc”.
The following pre-processing consists of the following steps:
Split the data_train_val into training and validation sets (80% and 20% respectively) stratified by the combinations of different labels. The training set comes in a different so it is already separated from the main training-validation dataset.
Create the set column to indicate whether the row is in the training, validation or test set.
Bind the training, validation and test sets into a single dataset.
Clean the text data:
Fix technical artifacts: replace "" with ", and remove " from the beginning and the end of the strings with comment text.
Strip leading and trailing whitespaces too.
Save the pre-processed data to a file.
The sizes of the training, validation and test sets are presented in Fig. 2.4.
Table 2.1. The distribution of toxic comment type combinations in the training and validation data. Value 1 indicates presence and 0 indicates absence of comment types in this particular order: toxic, severe_toxic, obscene, threat, insult, identity hate.
How to
I have cleaned up the page and removed whole sections on "How-To". How-Tos are specificaly listed as not being encyclopedic (please see Wikipedia:What Wikipedia is not#Wikipedia is not an indiscriminate collection of information-> Instruction manuals. I took it on my self to add the deleted information to WikiHOW. It can be seen (and edited) at How to Photograph the Night Sky (Astrophotography). Other sections were re-worded to make them more descriptive and less "How-To".
Code
set_counts = data_merged.set.value_counts(sort=False)ax = set_counts.plot(kind="bar", rot=0, color="skyblue", ec="black", figsize=(10, 4))for i, v inenumerate(set_counts): ax.text( i, v *1.02,f" {v:,} ({v/len(data_merged) *100:.1f}%)", color="black", va="bottom", ha="center", fontweight="bold", )ax.set_xlabel("Set")ax.set_ylabel("Count")ax.set_ylim(top=ax.get_ylim()[1] *1.08)ax.set_title("The Size of Training, Validation, and Test Sets")plt.show()
Fig. 2.4. Sizes of the training, validation, and test sets.
This section includes resource-intensive computations that were performed on Google Colab with NVIDIA V100 GPU enabled.
3.1 Model: DistilBERT
Now, the pre-trained DistilBERT model will be loaded and the pre-classifier and classifier (the two last) layers will be prepared for training while the remaining layers will be frozen.
Code
# Suppress unnecessary messageslogging.getLogger("transformers.modeling_utils").setLevel(logging.ERROR)# Load pretrained model/tokenizerpretrained_weights ="distilbert-base-uncased"tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)# NOTE: AutoModelForSequenceClassification uses BCEWithLogitsLoss# without weights so we will implement the weighted loss latermodel = AutoModelForSequenceClassification.from_pretrained( pretrained_weights, num_labels=len(all_labels), problem_type="multi_label_classification",)# Freeze all layersfor param in model.parameters(): param.requires_grad =False# Unfreeze last 2 layers for fine-tuningfor param in model.pre_classifier.parameters(): param.requires_grad =Truefor param in model.classifier.parameters(): param.requires_grad =True
weighted binary cross-entropy loss is used to address class imbalance;
AdamW optimizer is used;
alongside loss, balanced accuracy, and F1 score, multi-label confusion matrices were logged to TensorBoard at each epoch for both training and validation sets.
2024-03-10 14:43:48
CPU times: user 2h 12min 56s, sys: 1min, total: 2h 13min 57s
Wall time: 2h 21min 26s
A single epoch took about 9-12 minutes to train part and 2-3 minutes to validate. In total, 12 epochs took about 2 hours and 21 minutes.
4 Results
Performed locally
The code in the following section was performed locally on a Windows 10 machine.
4.1 Re-Structure Logs
Currently, CSV and TensorBoard logs are stored in different directories which contain subfolders for models and versions. It would be more convenient to have a single folder for each model and different type of logs in it.
%%bashroot_dir="logs"# Without trailing slashlogger=("tensorboard_logs""tensorboard_logs""csv_logs")subdir=("checkpoints/""""")log_type=("checkpoints""tensorboard_logs""csv_logs")# Read model names (based on the contents of csv_logs directory only)readarray -t models <\<(find "$root_dir/csv_logs"-mindepth 1-maxdepth 1-type d -printf "%f\n"| sort -u)# Iterate over each model and find versionsfor model in"${models[@]}"; do echo " " echo "=== Model: $model ===" versions=$(find "$root_dir/csv_logs/$model"-mindepth 1-maxdepth 1-type d -printf "%f\n")for version in $versions; do echo " " echo "--- $version ---"# Loop through directoriesfor ((i=0; i<${#logger[@]}; i++)); dofrom="$root_dir/${logger[i]}/$model/$version/${subdir[i]}" to="$root_dir/$model/$version/${log_type[i]}/" echo " $from -> $to" mkdir -p "$to"# Create dir if it doesn't exist mv "$from"*"$to" done donedone# Delete empty directories recursivelyfind "$root_dir"-type d -empty -delete
To monitor the changes in training and validation metrics in each epoch, TensorBoard was used (Fig. 4.2, Fig. 4.2). The best validation performance was achieved in epoch 9 (see Table 4.2).
Fig. 4.2. Example: monitoring the training process metrics in TensorBoard.
4.3 Model Training Dynamics
This sub-section presents model performance at different training epochs. The training process took 12 epochs (Fig. 4.3) numbered from 0 to 11 with the best validation performance (weighted loss and F1 score) in epoch 9 (see Table 4.2). The results are graphically presented in Fig. Figure 4.4, Figure 4.5, and Figure 4.6.
Fig. 4.3. Training length in each epoch (a print screen from TensorBoard). The X-axis indicates the duration from the beginning of the training.
Code: Helpers to format results
barcolor ="#aaa"palegreen ="#448f44"
Code
log_path_1 ="logs/DistilBERT/version_0/csv_logs/metrics.csv"epoch_performance = read_metrics_log(log_path_1, "DistilBERT")epoch_performance_wide = read_metrics_log(log_path_1, "DistilBERT", out_format="wide")print("Dimensions in long format: ", epoch_performance.shape)print("Dimensions in wide format: ", epoch_performance_wide.shape)
Dimensions in long format: (24, 6)
Dimensions in wide format: (12, 11)
A few rows of the imported CSV log (long format):
Code
display(epoch_performance.head(4))
Table 4.1. An example of imported performance scores in the long format (top 4 rows).
epoch
set
accuracy
f1
loss
model
0
0
train
0.97
0.41
2.51
DistilBERT
1
0
val
0.97
0.44
1.86
DistilBERT
2
1
train
0.97
0.51
1.77
DistilBERT
3
1
val
0.97
0.53
1.74
DistilBERT
In the following tables:
train_ – training set scores;
val_ – validation set scores;
_diff – a difference between training and validation sets scores (positive numbers show that the training score is higher than the validation score);
Table 4.2. The results in five best-performing epochs. The best values of validation macro-averaged balanced accuracy, F1 score, and weighted binary cross-entropy loss are highlighted in green.
epoch
train_accuracy
train_f1
train_loss
val_accuracy
val_f1
val_loss
accuracy_diff
f1_diff
loss_diff
model
7
0.976
0.609
1.510
0.975
0.580
1.601
-0.001
-0.029
0.092
DistilBERT
8
0.976
0.607
1.487
0.976
0.548
1.610
-0.001
-0.058
0.123
DistilBERT
9
0.976
0.616
1.468
0.976
0.585
1.583
-0.001
-0.031
0.115
DistilBERT
10
0.977
0.621
1.450
0.976
0.576
1.600
-0.001
-0.045
0.150
DistilBERT
11
0.977
0.626
1.431
0.976
0.578
1.596
-0.001
-0.048
0.165
DistilBERT
Code
n_epochs = epoch_performance_wide.epoch.max()# Create subplotsfig, axes = plt.subplots( nrows=2, ncols=1, figsize=(8, 5), gridspec_kw={"height_ratios": [9, 5]})# Plot the first line plotsns.lineplot( data=epoch_performance, x="epoch", y="loss", hue="model", style="set", markers=True, ax=axes[0],)# axes[0].set_yscale('log')axes[0].set_title("Weighted Binary Cross-Entropy Loss")axes[0].set_xlim(-0.25, n_epochs +0.5)axes[0].set_ylim(0.0, 2.6)axes[0].set_xlabel("") # Remove x-axis titleaxes[0].set_ylabel("Loss")# Plot the second line plotsns.lineplot( data=epoch_performance_wide, x="epoch", y="loss_diff", hue="model", markers=True, ax=axes[1],)axes[1].axhline(y=0, color="grey", linestyle="--")axes[1].set_xlabel("Epoch")axes[1].set_ylabel("Difference in Loss\n(Validation – Training)")axes[1].set_xlim(-0.25, n_epochs +0.5)axes[1].set_ylim(-1.0, 0.5)# Common# Align y-axis labelsplt.tight_layout()plt.show()
Fig. 4.4. The change in weighted binary cross-entropy loss values by epoch.
Code
n_epochs = epoch_performance_wide.epoch.max()# Create subplotsfig, axes = plt.subplots( nrows=2, ncols=1, figsize=(8, 5), gridspec_kw={"height_ratios": [9, 5]})# Plot the first line plotsns.lineplot( data=epoch_performance, x="epoch", y="accuracy", hue="model", style="set", markers=True, palette=["darkred"], ax=axes[0],)axes[0].set_title("Balanced Accuracy")axes[0].set_xlim(-0.25, n_epochs +0.5)axes[0].set_ylim(0.96, 1)axes[0].set_xlabel("") # Remove x-axis titleaxes[0].set_ylabel("Accuracy")# Plot the second line plotsns.lineplot( data=epoch_performance_wide, x="epoch", y="accuracy_diff", hue="model", markers=True, palette=["darkred"], ax=axes[1],)axes[1].axhline(y=0, color="grey", linestyle="--")axes[1].set_xlabel("Epoch")axes[1].set_ylabel("Difference in Accuracy\n(Validation – Training)")axes[1].set_xlim(-0.25, n_epochs +0.5)axes[1].set_ylim(-0.005, 0.010)# Commonplt.tight_layout()plt.show()
Fig. 4.5. The change in macro averaged balanced accuracy values by epoch.
Code
n_epochs = epoch_performance_wide.epoch.max()# Create subplotsfig, axes = plt.subplots( nrows=2, ncols=1, figsize=(8, 5), gridspec_kw={"height_ratios": [9, 5]})# Plot the first line plotsns.lineplot( data=epoch_performance, x="epoch", y="f1", hue="model", style="set", markers=True, palette=["#2ca02c"], ax=axes[0],)axes[0].set_title("F1 Score")axes[0].set_xlim(-0.25, n_epochs +0.5)axes[0].set_ylim(0, 1)axes[0].set_xlabel("") # Remove x-axis titleaxes[0].set_ylabel("F1")# Plot the second line plotsns.lineplot( data=epoch_performance_wide, x="epoch", y="f1_diff", hue="model", markers=True, palette=["#2ca02c"], ax=axes[1],)axes[1].axhline(y=0, color="grey", linestyle="--")axes[1].set_xlabel("Epoch")axes[1].set_ylabel("Difference in F1\n(Validation – Training)")axes[1].set_xlim(-0.25, n_epochs +0.5)axes[1].set_ylim(-0.11, 0.15)# Commonplt.tight_layout()plt.show()
Fig. 4.6. The change in macro averaged F1 scores by epoch.
Figures 4.7–4.10 show confusion matrices for multi-label classification of best-performing epochs. The images are exported from the TensorBoard log. Each matrix represents a single label, which is written above the matrix. Value 0 indicates absence and 1 indicates presence of the label.
Fig. 4.7. Confusion matrices for multi-label classification in epoch 7 (validation). For details, refer description in the text.
Fig. 4.8. Confusion matrices for multi-label classification in epoch 8 (validation). For details, refer description in the text.
Fig. 4.9. Confusion matrices for multi-label classification in epoch 9 (validation), which indicates the best performance. For details, refer description in the text.
Fig. 4.10. Confusion matrices for multi-label classification in epoch 10 (validation). For details, refer description in the text.
Fig. 4.11. Confusion matrices for multi-label classification in epoch 11 (validation). For details, refer description in the text.
4.4 Best Model Evaluation
Performed on Colab
This section includes resource-intensive computations that were performed on Google Colab with NVIDIA A100 GPU enabled.
In this section, the final model was evaluated on the test set. Compared to the validation set, the test set showed a slightly lower performance: (validation → test) weighted loss 1.583 → 1.977, macro-averaged balanced accuracy 0.976 → 0.964, and macro-averaged F1 score 0.584 → 0.551. Smaller groups showed lower individual F1 score values (e.g., “threat” n=211, F1=0.367) compared to larger groups (e.g., “non-toxic” n=57,735, F1=0.956). Inference speed on powerful NVIDIA A100 GPU was approx 400 comments per second. More details are presented below.
Information on GPU used in this section:
!nvidia-smi
Mon Mar 11 18:32:27 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 30C P0 40W / 400W | 2MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
The results are presented below.
Code: Load final model (Epoch 9) and create its trainer object
with suppress_certain_logs_and_warnings(): test_results = trainer_gpu.predict(model=bert, datamodule=data_module)
In Colab with NVIDIA A100 GPU, inference for 63,978 comments in 3999 batches took about 2 minutes and 30 seconds (approx. 400 comments per second).
Fig. 4.12. Inference speed on NVIDIA A100 GPU.
Code: Collect data from different batches
# Collect data from different batchestest_preds = []test_targets = []for batch_result in test_results: test_preds.extend(batch_result["pred"].tolist()) test_targets.extend(batch_result["target"].tolist())# As NumPy arraystest_preds_np = np.array(test_preds)test_targets_np = np.array(test_targets)# As tensorstest_preds_pt = torch.tensor(test_preds_np)test_targets_pt = torch.tensor(test_targets_np).int()
Fig. 4.15. Predicted labels normalized confusion matrix for the test set.
Final Remarks
Environments:
The project was performed on 2 different machines: local and Colab. It would be less confusing and more reproducible to perform the whole project on a single machine.
The setup on a local machine and in Colab differed a bit (e.g., Python and pandas versions). This time I created the environment on the local machine first and only later found out about the restrictions in Colab. Nevertheless, the differences were not essential.
Data preparation and EDA:
The stop words were not removed from the comments. It might be beneficial to do so.
The requirements of this project clearly stated that the analyst should not concentrate on EDA and should pay attention to the modeling part. However, a more extensive EDA resulting in pre-processing may lead to better model performance.
Look into duplicates or compare the groups after some pre-processing (e.g., removing special symbols, emojis, URLs, HTML and similar text, using lowercase only, etc.).
Look into the language of the comments and remove non-English ones as the model was pre-trained on English text only.
Compare labels (groups):
by comment length;
the most common words;
the most common n-grams (bigrams, trigrams, etc.);
the most common tokens;
the most common characters.
Modeling:
Hyperparameter tuning was not performed in this project but it might be beneficial to do so.
Several different models/model architectures (e.g., BERT, RoBERTa, etc.) could be tested to find the best one for this task.
Scheduler and warm-up steps could be used to improve the training process. Unfortunately, I got some warnings from Lightning side so I decided to skip this part this time.
Presenting the results:
Confusion matrices from the TensoBoard used the default settings and were not well formatted. Based on this fact, the confusion matrices for the test set were improved.