Data preparation for the Training Pipeline ========================================== This module contains functions for data ingestion and preprocessing during the training phase. Data Ingestion -------------- .. automodule:: emotion_detective.data.training.data_ingestion :members: :undoc-members: :show-inheritance: - **load_data**: Load CSV or JSON file and return a DataFrame with specified text and emotion columns. .. code-block:: python from emotion_detective.data.training.data_ingestion import load_data # Example usage df = load_data('data.csv', 'text_column', 'emotion_column') Troubleshooting ~~~~~~~~~~~~~~~ - **Problem**: `FileNotFoundError` when the input file is not found. **Solution**: Ensure the input file path is correct and the file exists. .. code-block:: python try: df = load_data('data.csv', 'text_column', 'emotion_column') except FileNotFoundError: print("The specified file was not found. Please check the file path.") - **Problem**: `ValueError` when the file format is not supported. **Solution**: Ensure the file is in CSV or JSON format. .. code-block:: python try: df = load_data('data.csv', 'text_column', 'emotion_column') except ValueError: print("Unsupported file format. Please provide a CSV or JSON file.") - **Problem**: `KeyError` when the specified columns are not found in the file. **Solution**: Verify that the file contains the specified text and emotion columns. .. code-block:: python try: df = load_data('data.csv', 'text_column', 'emotion_column') except KeyError: print("The specified columns were not found in the file. Please check the column names.") Data Preprocessing ------------------ .. automodule:: emotion_detective.data.training.data_preprocessing :members: :undoc-members: :show-inheritance: - **preprocess_text**: Preprocesses text data in a specified DataFrame column. - **balancing_multiple_classes**: Balance the classes in a DataFrame containing multiple classes. - **spell_check_and_correct**: Perform spell checking and correction on the input column of a DataFrame. .. code-block:: python from emotion_detective.data.training.data_preprocessing import preprocess_text, balancing_multiple_classes, spell_check_and_correct # Example usage preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta') balanced_df = balancing_multiple_classes(final_df, 'emotion_column') corrected_df = spell_check_and_correct(df, 'text_column') Troubleshooting ~~~~~~~~~~~~~~~ - **Problem**: `KeyError` when the specified text or emotion column is not found in the DataFrame. **Solution**: Verify that the DataFrame contains the specified columns. .. code-block:: python try: preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta') except KeyError: print("The specified text or emotion column was not found in the DataFrame. Please check the column names.") - **Problem**: `ValueError` when an invalid tokenizer name is provided. **Solution**: Ensure the tokenizer name is correct and supported. .. code-block:: python try: preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta') except ValueError: print("Invalid tokenizer name. Please ensure the tokenizer name is correct and supported.") - **Problem**: `ValueError` when class balancing fails due to insufficient data. **Solution**: Ensure there is enough data in each class for balancing. .. code-block:: python try: balanced_df = balancing_multiple_classes(final_df, 'emotion_column') except ValueError: print("Class balancing failed due to insufficient data. Please ensure there is enough data in each class.") - **Problem**: `ImportError` when spell checking dependencies are not installed. **Solution**: Install the necessary dependencies for spell checking. .. code-block:: python try: corrected_df = spell_check_and_correct(df, 'text_column') except ImportError: print("Spell checking dependencies are not installed. Please install the necessary libraries and try again.")