avatarSaverio Mazza

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

811

Abstract

<li><b>Text Extraction</b>: Utilizes the <code>unstructured</code> package to extract text. (See <code>src/services/text_extractor.py</code>)</li><li><b>Text Translation</b>: Employs a multilingual model for optional text translation. (See <code>src/services/text_translator.py</code>)</li><li><b>Text Cleaning</b>: Aggregates and cleans text according to specific requirements. (Currently under development, see <code>src/services/text_cleaner.py</code>)</li><li><b>Text Embedding</b>: Processes embedding as per user-defined requirements and saves it either in storage or in a vector database. (Currently under development, see <code>src/services/text_embeddings.py</code>)</li></ul><p id="9a06">The system is built on the principle that a file will be processed only if it hasn’t been processed previously, w # Options hich is determined by checking the destination path. Outputs are consistently formatted in JSON to facilitate flexible manipulation and ingestion stages.</p><h1 id="c0de">Configuration</h1><p id="8727">Input and output paths are set through the <code>.env</code> file (refer to <code>.env.example</code>, which should be renamed to <code>.env</code>).</p><h1 id="5b2f">Roadmap</h1><ul><li>Parallelization and containerization of the process for enhanced performance.</li><li>Automation of the process by orchestrating an event-driven pipeline capable of leveraging serverless computing depending on the scale and processing time requirements.</li><li>Provisioning for varied inputs and outputs to facilitate choice or alternation between data lakes, SQL databases, or NoSQL databases.</li></ul></article></body>

Master ETL-Texts: A Complete Text Processing Guide

ETL-Texts has the aim of becoming one pipeline designed for extracting, translating, cleaning, and transforming text files into embeddings, making them readily usable for training or inference in various natural language processing models. It operates on the principle that each step in the process requires an input path and an output path, allowing for independent execution or a sequential flow through the pipeline.

https://github.com/mazzasaverio/etl-texts

Features

  • Text Extraction: Utilizes the unstructured package to extract text. (See src/services/text_extractor.py)
  • Text Translation: Employs a multilingual model for optional text translation. (See src/services/text_translator.py)
  • Text Cleaning: Aggregates and cleans text according to specific requirements. (Currently under development, see src/services/text_cleaner.py)
  • Text Embedding: Processes embedding as per user-defined requirements and saves it either in storage or in a vector database. (Currently under development, see src/services/text_embeddings.py)

The system is built on the principle that a file will be processed only if it hasn’t been processed previously, which is determined by checking the destination path. Outputs are consistently formatted in JSON to facilitate flexible manipulation and ingestion stages.

Configuration

Input and output paths are set through the .env file (refer to .env.example, which should be renamed to .env).

Roadmap

  • Parallelization and containerization of the process for enhanced performance.
  • Automation of the process by orchestrating an event-driven pipeline capable of leveraging serverless computing depending on the scale and processing time requirements.
  • Provisioning for varied inputs and outputs to facilitate choice or alternation between data lakes, SQL databases, or NoSQL databases.
Large Language Models
ChatGPT
Machine Learning
Data Science
Natural Language Process
Recommended from ReadMedium