The article provides a step-by-step guide on using Sphinx with Vertex AI pipelines to auto-generate comprehensive documentation for machine learning projects.
Abstract
The author of the article emphasizes the importance of maintaining up-to-date documentation for machine learning projects, particularly for production and proof of concept phases. The article offers a detailed tutorial on employing Sphinx, a tool favored by the Python community, to create structured documentation that includes advanced features such as logos, notes, images, and markdown documents. The process involves setting up the environment, installing Sphinx, configuring it to recognize Python modules, and integrating markdown files. The author illustrates the use of Sphinx with an end-to-end open-source example of a Vertex AI pipeline, demonstrating how to automate the documentation process and make it visually appealing with custom themes and logos. The article concludes with the benefits of using Sphinx, including its ability to save time and
Document Your Machine Learning Project in a Smart Way
Step-by-step tutorial on how to use Sphinx with Vertex AI pipelines.
The purpose of this article is to share the procedure for using Sphinx to auto-generate the documentation of your machine learning project.
I‘m going to use advanced features of Sphinx such as the addition of logos, notes, images, and markdown documents. Also, I am going to show the python package you need so that Sphinx can extract the docstrings presented in your Vertex pipelines.
Some context
Let’s get started! As you may know, having up-to-date documentation for your machine learning projects is vital for both the production and proof of concept phases.Why it is vital? Because it helps to clarify and simplify your modules, collaborate with your team, integrate quickly a new team member, make faster evolutions, and share with the business owners.
Personally, I have experienced so many cases in which due to time-to-market constraints the documentation was ignored, but this turned out to be fatal once the project was released in production. Therefore I advise you to sidestep any manual procedure, to generate your documentation, as such procedures always end up getting desynchronized and time-consuming.
So, before publishing your project, take some time to check the readability of your project. In my case, I tend to use the following files :
README— a file easy to read that provides an introduction and general information on the project such as the purpose, technical information, the software components
LICENCE — a file that mentions the license step-by-step procedure to follow for contributors
USAGE — a file to explain how to use the project
CHANGELOG — a file that tracks the changes and the released versions of the project
Please note that the most important file is the README. The contribution and usage information can be added directly to the Readme file. The changelog file can be added later on before releasing the project in production. To edit the files you can use markdown, simple text, orreStructuredText.
See below the overview of the process we are going to describe.
Overview of the process (Image by the author)
What is Sphinx ?
Sphinx is a powerful and easy-to-use open source auto-generator tool highly used by the Python community. It is able to generate excellent structured documentation. There exists a few alternatives such as MkDocs, Doxygen, pdoc, and others, but Sphinx remains a complete and easy-to-use strong competitor.
The main features:
support for several output formats: HTML, PDF, plain text, EPUB, TeX, etc.
automatic generation of the documentation
automatic link generation
multi-language support
various extensions available
Steps:
I. Set up the environment
II. Install a virtual environment
III. Install Sphinx
IV. Set-up Sphinx
V. Build the documentation
I. Set up the environment
Python 3
Local virtual machine or Vertex AI Workbench (Jupyter notebook running in a virtual environment with Python 3)
Python project that contains Vertex AI code
Virtualenv
Kfx — extension for kubeflow pipeline sdk
MyST parser — flavor of Markdown
Vertex project containing sdk pipelines
Let’s use an end-to-endopen source example of a Vertex AI pipelineunder the Apache-2.0 license. The project is a good example as the project uses Vertex pipelines and doesn't use a documentation generator.
First, clone the source code and go to the vertex-pipelines-end-to-end-samplesdirectory:
git clone https://github.com/GoogleCloudPlatform/vertex-pipelines-end-to-end-samples.git
cd vertex-pipelines-end-to-end-samples
Install at once Sphinx and its extensions listed in the requirements-sphinx.txt:
pip install -r requirements-sphinx.txt
Create a docs directory (if doesn’t exist) to store the Sphinx layout :
mkdir docs
cd docs
Generate the initial directory structure with sphinx-quickstart command:
sphinx-quickstart
Choose separate sources and build directories, the project name, author name, project release, and the project language. You can find below my configuration:
You should obtain the following tree structure :
As you can see, we chose to separate the build and the source directories. Let’s give a few explanations about its content.
The build/ directory is meant to keep the generated documentation. It is empty for now as we don’t have yet any generated documentation.
The make.bat (Windows) and Makefile(Unix) files are scripts that simplify the generation of documentation.
The source/conf.py is the configuration file of the Sphinx project. It contains the default configuration keys and the configuration you specified to sphinx-quickstart.
The source/index.rstis the root document of the project that contains the table of contents tree (toctree) directive where you should list all the modules you want to include in your document.
The _static directory contains custom stylesheets and other static files.
The _templates directory stores the Sphinx templates.
IV. Set up Sphinx
Identify the python modules: /pipelines
The directory /pipelines contains the python code we want to include in the Sphinx documentation. Note that Sphinx sees the submodules present in the pipelines package only if you add an __init__.py file in the /pipelines directory.
Generate the Sphinx sources
Use the sphinx-apidoc to build your API documentation (be sure you are at the root of the project). The created Sphinx sources are stored at docs/source/pipelines.
You can check that the following files were created at docs/source/pipelines:
Copy the markdown files to the docs/source
Copy the README.md, CONTRIBUTING.md, and USAGE.md files automatically in the Sphinx source directory (docs/source/). Add in the docs/Makefile the following lines to automate the synchronization of markdown files:
We reached the end of our journey with Sphinx. I hope that you found the content useful!
Summary
We have seen how to use Sphinx, a powerful tool to generate documentation for your machine-learning project. We have customized the documentation with logos, images, and markdown content. Of course, Sphinx comes with plenty of other extensions you can use to render your documentation even more appealing.
Thank you for reading!
Don’t forget to subscribe if you want to get my future stories in your inbox.
If you enjoy reading my story and want to support me as a writer, consider signing up to become a Medium member and gain access to thousands of Data Engineering and Data Science articles.