emotion_clf_pipeline.data
Classes
|
A class to handle data preparation for emotion classification tasks. |
A class to handle loading and preprocessing of emotion classification datasets. |
|
|
Custom Dataset for emotion classification. |
|
A comprehensive feature extraction class for text analysis. |
- class emotion_clf_pipeline.data.DataPreparation(output_columns, tokenizer, max_length=128, batch_size=16, feature_config=None, encoders_save_dir=None, encoders_load_dir=None)[source]
Bases:
object
A class to handle data preparation for emotion classification tasks.
This class handles: - Label encoding for target variables - Dataset creation - Dataloader setup
- Parameters:
output_columns (list) – List of output column names to encode
model_name (str) – Name of the pretrained model to use for tokenization
max_length (int) – Maximum sequence length for tokenization
batch_size (int) – Batch size for dataloaders
feature_config (dict, optional) – Configuration for feature extraction
- __init__(output_columns, tokenizer, max_length=128, batch_size=16, feature_config=None, encoders_save_dir=None, encoders_load_dir=None)[source]
- prepare_data(train_df, test_df=None, validation_split=0.2, apply_augmentation=False, balance_strategy='equal', samples_per_class=None, augmentation_ratio=2)[source]
Prepare data for training emotion classification models.
- Parameters:
train_df (pd.DataFrame) – Training dataframe
test_df (pd.DataFrame, optional) – Test dataframe. Defaults to None.
validation_split (float, optional) – Fraction of training data to use
0.2. (for validation. Defaults to)
apply_augmentation (bool, optional) – Whether to apply data
False. (augmentation. Defaults to)
balance_strategy (str, optional) – Strategy for balancing if
Options (augmentation is applied.) – ‘equal’, ‘majority’, ‘target’.
'equal'. (Defaults to)
samples_per_class (int, optional) – Number of samples per class for
None. (balancing. Defaults to)
augmentation_ratio (int, optional) – Maximum ratio of augmented to
2. (original samples. Defaults to)
- Returns:
(train_dataset, val_dataset, test_dataset, train_dataloader, val_dataloader, test_dataloader, class_weights_tensor)
- Return type:
- class emotion_clf_pipeline.data.DatasetLoader[source]
Bases:
object
A class to handle loading and preprocessing of emotion classification datasets.
This class handles: - Loading training and test data from CSV files - Cleaning and preprocessing the data - Mapping emotions to standardized categories - Visualizing data distributions
- train_df
Processed training data
- Type:
pd.DataFrame
- test_df
Processed test data
- Type:
pd.DataFrame
- load_test_data(test_file='./../../data/test_data-0001.csv')[source]
Load and preprocess test data from a CSV file.
- Parameters:
test_file (str) – Path to the test data CSV file
- Returns:
Processed test data
- Return type:
pd.DataFrame
- class emotion_clf_pipeline.data.EmotionDataset(*args, **kwargs)[source]
Bases:
Dataset
Custom Dataset for emotion classification.
- __init__(texts, tokenizer, features, labels=None, feature_extractor=None, max_length=128, output_tasks=None)[source]
Initialize the dataset.
- Parameters:
texts (list) – List of text samples
tokenizer – BERT tokenizer
features (np.ndarray) – Pre-extracted features
labels (list, optional) – List of label tuples (emotion, sub_emotion,
prediction. (intensity). None for)
feature_extractor (FeatureExtractor, optional) – Feature extractor
pre-computed. (instance. Not strictly needed if features are)
max_length (int) – Maximum sequence length for BERT
output_tasks (list, optional) – List of tasks to output. Used only if
provided. (labels are)
- emotion_clf_pipeline.data.log_class_distributions(df, output_tasks, df_name)[source]
Logs the class distribution for specified tasks in a dataframe.
- Parameters:
df (pandas.DataFrame)
df_name (str)