Data preparation for the Training Pipeline

This module contains functions for data ingestion and preprocessing during the training phase.

Data Ingestion

load_data: Load CSV or JSON file and return a DataFrame with specified text and emotion columns.

from emotion_detective.data.training.data_ingestion import load_data

# Example usage
df = load_data('data.csv', 'text_column', 'emotion_column')

emotion_detective.data.training.data_ingestion.load_data(file_path: str, text_column: str, emotion_column: str) → DataFrame

Load CSV or JSON file and return a DataFrame with specified text and emotion columns, renamed to ‘text’ and ‘label’ respectively.

Parameters:

file_path (str) – Path to the CSV or JSON file.
text_column (str) – Name of the column containing text data.
emotion_column (str) – Name of the column containing emotion data.

Returns:

DataFrame with text and emotion columns renamed.

Return type:

pd.DataFrame

Author: Martin Vladimirov

Troubleshooting

Problem: FileNotFoundError when the input file is not found. Solution: Ensure the input file path is correct and the file exists.
try:
    df = load_data('data.csv', 'text_column', 'emotion_column')
except FileNotFoundError:
    print("The specified file was not found. Please check the file path.")
Problem: ValueError when the file format is not supported. Solution: Ensure the file is in CSV or JSON format.
try:
    df = load_data('data.csv', 'text_column', 'emotion_column')
except ValueError:
    print("Unsupported file format. Please provide a CSV or JSON file.")
Problem: KeyError when the specified columns are not found in the file. Solution: Verify that the file contains the specified text and emotion columns.
try:
    df = load_data('data.csv', 'text_column', 'emotion_column')
except KeyError:
    print("The specified columns were not found in the file. Please check the column names.")

Data Preprocessing

preprocess_text: Preprocesses text data in a specified DataFrame column.
balancing_multiple_classes: Balance the classes in a DataFrame containing multiple classes.
spell_check_and_correct: Perform spell checking and correction on the input column of a DataFrame.

from emotion_detective.data.training.data_preprocessing import preprocess_text, balancing_multiple_classes, spell_check_and_correct

# Example usage
preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
balanced_df = balancing_multiple_classes(final_df, 'emotion_column')
corrected_df = spell_check_and_correct(df, 'text_column')

emotion_detective.data.training.data_preprocessing.balancing_multiple_classes(df: DataFrame, emotion_column: str = 'label') → DataFrame

Balance the classes in a DataFrame containing multiple classes.

Parameters: df (pd.DataFrame): The DataFrame containing the data. emotion_column (str): The name of the column containing class labels.

Returns: pd.DataFrame: A balanced DataFrame with an equal number of samples for each class.

Author: Amy Suneeth

emotion_detective.data.training.data_preprocessing.preprocess_text(df: DataFrame, text_column: str = 'text', emotion_column: str = 'label', mapping_filename: str | None = None) → DataFrame

Preprocess text data in a specified DataFrame column by: 1. Lowercasing all text. 2. Mapping emotion labels from strings to integers and storing in a new column.

Parameters: df (pd.DataFrame): Input DataFrame containing text data and emotion labels. text_column (str): Name of the column in the DataFrame containing text data. emotion_column (str): Name of the column in the DataFrame containing emotion labels. mapping_filename (str): Optional filename to save the emotion mapping.

Returns: pd.DataFrame: DataFrame with lowercased text and integer emotion labels.

Author: Martin Vladimirov

emotion_detective.data.training.data_preprocessing.spell_check_and_correct(df: DataFrame, text_column: str = 'text') → DataFrame

Perform spell checking and correction on the input column of a DataFrame.

Parameters: df (pd.DataFrame): Input DataFrame. text_column (str): Column in df containing text with potential spelling errors.

Returns: pd.DataFrame: DataFrame with spelling errors in the specified column corrected.

Author: Amy Suneeth

Troubleshooting

Problem: KeyError when the specified text or emotion column is not found in the DataFrame. Solution: Verify that the DataFrame contains the specified columns.
try:
    preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
except KeyError:
    print("The specified text or emotion column was not found in the DataFrame. Please check the column names.")
Problem: ValueError when an invalid tokenizer name is provided. Solution: Ensure the tokenizer name is correct and supported.
try:
    preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
except ValueError:
    print("Invalid tokenizer name. Please ensure the tokenizer name is correct and supported.")
Problem: ValueError when class balancing fails due to insufficient data. Solution: Ensure there is enough data in each class for balancing.
try:
    balanced_df = balancing_multiple_classes(final_df, 'emotion_column')
except ValueError:
    print("Class balancing failed due to insufficient data. Please ensure there is enough data in each class.")
Problem: ImportError when spell checking dependencies are not installed. Solution: Install the necessary dependencies for spell checking.
try:
    corrected_df = spell_check_and_correct(df, 'text_column')
except ImportError:
    print("Spell checking dependencies are not installed. Please install the necessary libraries and try again.")