Data preparation for the Training Pipeline
This module contains functions for data ingestion and preprocessing during the training phase.
Data Ingestion
load_data: Load CSV or JSON file and return a DataFrame with specified text and emotion columns.
from emotion_detective.data.training.data_ingestion import load_data
# Example usage
df = load_data('data.csv', 'text_column', 'emotion_column')
- emotion_detective.data.training.data_ingestion.load_data(file_path: str, text_column: str, emotion_column: str) DataFrame
Load CSV or JSON file and return a DataFrame with specified text and emotion columns, renamed to ‘text’ and ‘label’ respectively.
- Parameters:
file_path (str) – Path to the CSV or JSON file.
text_column (str) – Name of the column containing text data.
emotion_column (str) – Name of the column containing emotion data.
- Returns:
DataFrame with text and emotion columns renamed.
- Return type:
pd.DataFrame
Author: Martin Vladimirov
Troubleshooting
Problem: FileNotFoundError when the input file is not found. Solution: Ensure the input file path is correct and the file exists.
try: df = load_data('data.csv', 'text_column', 'emotion_column') except FileNotFoundError: print("The specified file was not found. Please check the file path.")Problem: ValueError when the file format is not supported. Solution: Ensure the file is in CSV or JSON format.
try: df = load_data('data.csv', 'text_column', 'emotion_column') except ValueError: print("Unsupported file format. Please provide a CSV or JSON file.")Problem: KeyError when the specified columns are not found in the file. Solution: Verify that the file contains the specified text and emotion columns.
try: df = load_data('data.csv', 'text_column', 'emotion_column') except KeyError: print("The specified columns were not found in the file. Please check the column names.")
Data Preprocessing
preprocess_text: Preprocesses text data in a specified DataFrame column.
balancing_multiple_classes: Balance the classes in a DataFrame containing multiple classes.
spell_check_and_correct: Perform spell checking and correction on the input column of a DataFrame.
from emotion_detective.data.training.data_preprocessing import preprocess_text, balancing_multiple_classes, spell_check_and_correct
# Example usage
preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
balanced_df = balancing_multiple_classes(final_df, 'emotion_column')
corrected_df = spell_check_and_correct(df, 'text_column')
- emotion_detective.data.training.data_preprocessing.balancing_multiple_classes(df: DataFrame, emotion_column: str = 'label') DataFrame
Balance the classes in a DataFrame containing multiple classes.
Parameters: df (pd.DataFrame): The DataFrame containing the data. emotion_column (str): The name of the column containing class labels.
Returns: pd.DataFrame: A balanced DataFrame with an equal number of samples for each class.
Author: Amy Suneeth
- emotion_detective.data.training.data_preprocessing.preprocess_text(df: DataFrame, text_column: str = 'text', emotion_column: str = 'label', mapping_filename: str | None = None) DataFrame
Preprocess text data in a specified DataFrame column by: 1. Lowercasing all text. 2. Mapping emotion labels from strings to integers and storing in a new column.
Parameters: df (pd.DataFrame): Input DataFrame containing text data and emotion labels. text_column (str): Name of the column in the DataFrame containing text data. emotion_column (str): Name of the column in the DataFrame containing emotion labels. mapping_filename (str): Optional filename to save the emotion mapping.
Returns: pd.DataFrame: DataFrame with lowercased text and integer emotion labels.
Author: Martin Vladimirov
- emotion_detective.data.training.data_preprocessing.spell_check_and_correct(df: DataFrame, text_column: str = 'text') DataFrame
Perform spell checking and correction on the input column of a DataFrame.
Parameters: df (pd.DataFrame): Input DataFrame. text_column (str): Column in df containing text with potential spelling errors.
Returns: pd.DataFrame: DataFrame with spelling errors in the specified column corrected.
Author: Amy Suneeth
Troubleshooting
Problem: KeyError when the specified text or emotion column is not found in the DataFrame. Solution: Verify that the DataFrame contains the specified columns.
try: preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta') except KeyError: print("The specified text or emotion column was not found in the DataFrame. Please check the column names.")Problem: ValueError when an invalid tokenizer name is provided. Solution: Ensure the tokenizer name is correct and supported.
try: preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta') except ValueError: print("Invalid tokenizer name. Please ensure the tokenizer name is correct and supported.")Problem: ValueError when class balancing fails due to insufficient data. Solution: Ensure there is enough data in each class for balancing.
try: balanced_df = balancing_multiple_classes(final_df, 'emotion_column') except ValueError: print("Class balancing failed due to insufficient data. Please ensure there is enough data in each class.")Problem: ImportError when spell checking dependencies are not installed. Solution: Install the necessary dependencies for spell checking.
try: corrected_df = spell_check_and_correct(df, 'text_column') except ImportError: print("Spell checking dependencies are not installed. Please install the necessary libraries and try again.")