Data preparation for the Training Pipeline
==========================================

This module contains functions for data ingestion and preprocessing during the training phase.

Data Ingestion
--------------

.. automodule:: emotion_detective.data.training.data_ingestion
   :members:
   :undoc-members:
   :show-inheritance:

   - **load_data**: Load CSV or JSON file and return a DataFrame with specified text and emotion columns.

   .. code-block:: python

      from emotion_detective.data.training.data_ingestion import load_data

      # Example usage
      df = load_data('data.csv', 'text_column', 'emotion_column')

Troubleshooting
~~~~~~~~~~~~~~~

   - **Problem**: `FileNotFoundError` when the input file is not found.
     **Solution**: Ensure the input file path is correct and the file exists.

     .. code-block:: python

        try:
            df = load_data('data.csv', 'text_column', 'emotion_column')
        except FileNotFoundError:
            print("The specified file was not found. Please check the file path.")

   - **Problem**: `ValueError` when the file format is not supported.
     **Solution**: Ensure the file is in CSV or JSON format.

     .. code-block:: python

        try:
            df = load_data('data.csv', 'text_column', 'emotion_column')
        except ValueError:
            print("Unsupported file format. Please provide a CSV or JSON file.")

   - **Problem**: `KeyError` when the specified columns are not found in the file.
     **Solution**: Verify that the file contains the specified text and emotion columns.

     .. code-block:: python

        try:
            df = load_data('data.csv', 'text_column', 'emotion_column')
        except KeyError:
            print("The specified columns were not found in the file. Please check the column names.")


Data Preprocessing
------------------

.. automodule:: emotion_detective.data.training.data_preprocessing
   :members:
   :undoc-members:
   :show-inheritance:

   - **preprocess_text**: Preprocesses text data in a specified DataFrame column.
   - **balancing_multiple_classes**: Balance the classes in a DataFrame containing multiple classes.
   - **spell_check_and_correct**: Perform spell checking and correction on the input column of a DataFrame.

   .. code-block:: python

      from emotion_detective.data.training.data_preprocessing import preprocess_text, balancing_multiple_classes, spell_check_and_correct

      # Example usage
      preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
      balanced_df = balancing_multiple_classes(final_df, 'emotion_column')
      corrected_df = spell_check_and_correct(df, 'text_column')

Troubleshooting
~~~~~~~~~~~~~~~

   - **Problem**: `KeyError` when the specified text or emotion column is not found in the DataFrame.
     **Solution**: Verify that the DataFrame contains the specified columns.

     .. code-block:: python

        try:
            preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
        except KeyError:
            print("The specified text or emotion column was not found in the DataFrame. Please check the column names.")

   - **Problem**: `ValueError` when an invalid tokenizer name is provided.
     **Solution**: Ensure the tokenizer name is correct and supported.

     .. code-block:: python

        try:
            preprocessed_df = preprocess_text(df, 'text_column', 'emotion_column', tokenizer_name='roberta')
        except ValueError:
            print("Invalid tokenizer name. Please ensure the tokenizer name is correct and supported.")

   - **Problem**: `ValueError` when class balancing fails due to insufficient data.
     **Solution**: Ensure there is enough data in each class for balancing.

     .. code-block:: python

        try:
            balanced_df = balancing_multiple_classes(final_df, 'emotion_column')
        except ValueError:
            print("Class balancing failed due to insufficient data. Please ensure there is enough data in each class.")

   - **Problem**: `ImportError` when spell checking dependencies are not installed.
     **Solution**: Install the necessary dependencies for spell checking.

     .. code-block:: python

        try:
            corrected_df = spell_check_and_correct(df, 'text_column')
        except ImportError:
            print("Spell checking dependencies are not installed. Please install the necessary libraries and try again.")