emotion_clf_pipeline.data

Classes

DataPreparation(output_columns, tokenizer[, ...])

A class to handle data preparation for emotion classification tasks.

DatasetLoader()

A class to handle loading and preprocessing of emotion classification datasets.

EmotionDataset(*args, **kwargs)

Custom Dataset for emotion classification.

FeatureExtractor([feature_config, lexicon_path])

A comprehensive feature extraction class for text analysis.

class emotion_clf_pipeline.data.DataPreparation(output_columns, tokenizer, max_length=128, batch_size=16, feature_config=None, encoders_save_dir=None, encoders_load_dir=None)[source]

Bases: object

A class to handle data preparation for emotion classification tasks.

This class handles: - Label encoding for target variables - Dataset creation - Dataloader setup

Parameters:
  • output_columns (list) – List of output column names to encode

  • model_name (str) – Name of the pretrained model to use for tokenization

  • max_length (int) – Maximum sequence length for tokenization

  • batch_size (int) – Batch size for dataloaders

  • feature_config (dict, optional) – Configuration for feature extraction

__init__(output_columns, tokenizer, max_length=128, batch_size=16, feature_config=None, encoders_save_dir=None, encoders_load_dir=None)[source]
get_num_classes()[source]

Get the number of classes for each output column.

prepare_data(train_df, test_df=None, validation_split=0.2, apply_augmentation=False, balance_strategy='equal', samples_per_class=None, augmentation_ratio=2)[source]

Prepare data for training emotion classification models.

Parameters:
  • train_df (pd.DataFrame) – Training dataframe

  • test_df (pd.DataFrame, optional) – Test dataframe. Defaults to None.

  • validation_split (float, optional) – Fraction of training data to use

  • 0.2. (for validation. Defaults to)

  • apply_augmentation (bool, optional) – Whether to apply data

  • False. (augmentation. Defaults to)

  • balance_strategy (str, optional) – Strategy for balancing if

  • Options (augmentation is applied.) – ‘equal’, ‘majority’, ‘target’.

  • 'equal'. (Defaults to)

  • samples_per_class (int, optional) – Number of samples per class for

  • None. (balancing. Defaults to)

  • augmentation_ratio (int, optional) – Maximum ratio of augmented to

  • 2. (original samples. Defaults to)

Returns:

(train_dataset, val_dataset, test_dataset, train_dataloader, val_dataloader, test_dataloader, class_weights_tensor)

Return type:

tuple

class emotion_clf_pipeline.data.DatasetLoader[source]

Bases: object

A class to handle loading and preprocessing of emotion classification datasets.

This class handles: - Loading training and test data from CSV files - Cleaning and preprocessing the data - Mapping emotions to standardized categories - Visualizing data distributions

emotion_mapping

Dictionary mapping sub-emotions to standardized emotions

Type:

dict

train_df

Processed training data

Type:

pd.DataFrame

test_df

Processed test data

Type:

pd.DataFrame

__init__()[source]
load_test_data(test_file='./../../data/test_data-0001.csv')[source]

Load and preprocess test data from a CSV file.

Parameters:

test_file (str) – Path to the test data CSV file

Returns:

Processed test data

Return type:

pd.DataFrame

load_training_data(data_dir='./../../data/raw/all groups')[source]

Load and preprocess training data from multiple CSV files.

Parameters:

data_dir (str) – Directory containing training data CSV files

Returns:

Processed training data

Return type:

pd.DataFrame

plot_distributions()[source]

Plot distributions of emotions, sub-emotions, and intensities for both training and test sets.

class emotion_clf_pipeline.data.EmotionDataset(*args, **kwargs)[source]

Bases: Dataset

Custom Dataset for emotion classification.

__init__(texts, tokenizer, features, labels=None, feature_extractor=None, max_length=128, output_tasks=None)[source]

Initialize the dataset.

Parameters:
  • texts (list) – List of text samples

  • tokenizer – BERT tokenizer

  • features (np.ndarray) – Pre-extracted features

  • labels (list, optional) – List of label tuples (emotion, sub_emotion,

  • prediction. (intensity). None for)

  • feature_extractor (FeatureExtractor, optional) – Feature extractor

  • pre-computed. (instance. Not strictly needed if features are)

  • max_length (int) – Maximum sequence length for BERT

  • output_tasks (list, optional) – List of tasks to output. Used only if

  • provided. (labels are)

emotion_clf_pipeline.data.log_class_distributions(df, output_tasks, df_name)[source]

Logs the class distribution for specified tasks in a dataframe.

Parameters:
emotion_clf_pipeline.data.main()[source]

Main function to run the data processing pipeline.

emotion_clf_pipeline.data.parse_args()[source]

Parse command-line arguments.