Data preparation for the Training Pipeline

This module contains functions for data ingestion and preprocessing during the training phase.

Data Ingestion

  • load_data: Load CSV or JSON file and return a DataFrame with specified text and emotion columns.

from emotion_detective.data.training.data_ingestion import load_data

# Example usage
df = load_data('data.csv', 'text_column', 'emotion_column')
emotion_detective.data.training.data_ingestion.load_data(file_path: str, text_column: str, emotion_column: str) DataFrame

Load CSV or JSON file and return a DataFrame with specified text and emotion columns, renamed to ‘text’ and ‘label’ respectively.

Parameters:
  • file_path (str) – Path to the CSV or JSON file.

  • text_column (str) – Name of the column containing text data.

  • emotion_column (str) – Name of the column containing emotion data.

Returns:

DataFrame with text and emotion columns renamed.

Return type:

pd.DataFrame

Author: Martin Vladimirov

Troubleshooting

  • Problem: FileNotFoundError when the input file is not found. Solution: Ensure the input file path is correct and the file exists.

    try:
        df = load_data('data.csv', 'text_column', 'emotion_column')
    except FileNotFoundError:
        print("The specified file was not found. Please check the file path.")
    
  • Problem: ValueError when the file format is not supported. Solution: Ensure the file is in CSV or JSON format.

    try:
        df = load_data('data.csv', 'text_column', 'emotion_column')
    except ValueError:
        print("Unsupported file format. Please provide a CSV or JSON file.")
    
  • Problem: KeyError when the specified columns are not found in the file. Solution: Verify that the file contains the specified text and emotion columns.

    try:
        df = load_data('data.csv', 'text_column', 'emotion_column')
    except KeyError:
        print("The specified columns were not found in the file. Please check the column names.")
    

Data Preprocessing

  • preprocess_text: Preprocesses text data in a specified DataFrame column.

  • balancing_multiple_classes: Balance the classes in a DataFrame containing multiple classes.

  • spell_check_and_correct: Perform spell checking and correction on the input column of a DataFrame.

from emotion_detective.data.training.data_preprocessing import preprocess_text, balancing_multiple_classes, spell_check_and_correct

# Example usage
preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
balanced_df = balancing_multiple_classes(final_df, 'emotion_column')
corrected_df = spell_check_and_correct(df, 'text_column')
emotion_detective.data.training.data_preprocessing.balancing_multiple_classes(df: DataFrame, emotion_column: str = 'label') DataFrame

Balance the classes in a DataFrame containing multiple classes.

Parameters: df (pd.DataFrame): The DataFrame containing the data. emotion_column (str): The name of the column containing class labels.

Returns: pd.DataFrame: A balanced DataFrame with an equal number of samples for each class.

Author: Amy Suneeth

emotion_detective.data.training.data_preprocessing.preprocess_text(df: DataFrame, text_column: str = 'text', emotion_column: str = 'label', mapping_filename: str | None = None) DataFrame

Preprocess text data in a specified DataFrame column by: 1. Lowercasing all text. 2. Mapping emotion labels from strings to integers and storing in a new column.

Parameters: df (pd.DataFrame): Input DataFrame containing text data and emotion labels. text_column (str): Name of the column in the DataFrame containing text data. emotion_column (str): Name of the column in the DataFrame containing emotion labels. mapping_filename (str): Optional filename to save the emotion mapping.

Returns: pd.DataFrame: DataFrame with lowercased text and integer emotion labels.

Author: Martin Vladimirov

emotion_detective.data.training.data_preprocessing.spell_check_and_correct(df: DataFrame, text_column: str = 'text') DataFrame

Perform spell checking and correction on the input column of a DataFrame.

Parameters: df (pd.DataFrame): Input DataFrame. text_column (str): Column in df containing text with potential spelling errors.

Returns: pd.DataFrame: DataFrame with spelling errors in the specified column corrected.

Author: Amy Suneeth

Troubleshooting

  • Problem: KeyError when the specified text or emotion column is not found in the DataFrame. Solution: Verify that the DataFrame contains the specified columns.

    try:
        preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
    except KeyError:
        print("The specified text or emotion column was not found in the DataFrame. Please check the column names.")
    
  • Problem: ValueError when an invalid tokenizer name is provided. Solution: Ensure the tokenizer name is correct and supported.

    try:
        preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
    except ValueError:
        print("Invalid tokenizer name. Please ensure the tokenizer name is correct and supported.")
    
  • Problem: ValueError when class balancing fails due to insufficient data. Solution: Ensure there is enough data in each class for balancing.

    try:
        balanced_df = balancing_multiple_classes(final_df, 'emotion_column')
    except ValueError:
        print("Class balancing failed due to insufficient data. Please ensure there is enough data in each class.")
    
  • Problem: ImportError when spell checking dependencies are not installed. Solution: Install the necessary dependencies for spell checking.

    try:
        corrected_df = spell_check_and_correct(df, 'text_column')
    except ImportError:
        print("Spell checking dependencies are not installed. Please install the necessary libraries and try again.")