robots.txt, security.txt, and humans.txt,
a new standard has been proposed to the web ecosystem and will soon become essential for the web: llms.txt.
llms.txt was conceived by Jeremy Howard, co-founder of Answer.AI, to address a fundamental challenge in AI-human interaction.
When AI assistants attempt to process standard web pages, they struggle with non-essential elements like navigation menus, scripts, and styling.
These elements consume valuable context space without contributing to the actual content understanding.
llms.txt provides an elegant solution: it delivers precisely curated information in a format that AI systems can efficiently process and understand.
If you need to convert files from one markup format to another, Pandoc is your swiss-army knife.
Developed by John MacFarlane, Pandoc is a Haskell library for converting from one markup format to another
and John provides in this pandoc repo a command-line tool that uses this Pandoc library.
Easy to install and ready to convert.
In this How-to guide, we will see how to install this pandoc command line tool on your Upsun project.
Assumptions:
- You already have an Upsun account. If you don’t, please register for a trial account. You can sign up with an email address or an existing GitHub, Bitbucket, or Google account. If you choose one of these accounts, you can set a password for your Upsun account later.
- You have the Upsun CLI installed locally.
- You have the Git CLI installed locally.
pandoc on your project and quickly generate a llms.txt file from your HTML pages.
Prepare your local HTML project
In order to quickly showcase the strength of Pandoc, we will simulate a simple HTML application that could be obtained using a static website generator like Hugo. The proposed structure will be:🚨 Please note: This
html-app-example.tar.gz file contains all HTML files (index.html, ./learn/*.html) in this llms folder.Give Pandoc a try
To showcase the power of Pandoc, let’s give it a try locally and convert our HTML to anllms.txt file.
Install Pandoc locally
To install Pandoc locally, please follow the official Installation Guide.Use Pandoc for HTML to Markdown conversion
You should now have access topandoc tool and we will use it to generate a public/llms-test.txt file that will concatenate all the HTML pages of the project in Markdown.
Let’s execute this command line that will look for all HTML files in the public folder and concat them in a single file ./public/llms-test.txt:
Use Pandoc in your Upsun project
Generating thisllms.txt file locally and pushing it in your source code is not convenient.
We would like this generation to be dynamic, each time you update your website content.
Init your Upsun config
Upsun CLI provides a command to initialize a basic config for your local code. As it is a simple HTML app, we will generate a minimum configuration file using the following command:Javascript/Node.js- application name:
app - no service selected
.upsun/config.yaml file for the router to point to your public folder:
Create an Upsun project
You then need to create an Upsun project by executing these commands and follow the prompts:Install Pandoc
There is to ways to installpandoc on your project:
Using a shell script
John MacFarlane provides in his Pandoc repo a quick and easy way to install Pandoc. We’ve prepared a shell script for you (source) that can be used to install the latest version of Pandoc. Update your.upsun/config.yaml file and add this curl call in your applications.app.hooks.build step:
install-pandoc.sh script installs the pandoc binary from Pandoc repo in the /app/.global/bin folder of your application container.
Using Composable image
The Upsun Composable image provides enhanced flexibility when defining your app. It allows you to install several runtimes and tools in your application container, in a “one image to rule them all” approach. The composable image is built on Nix and the good is Pandoc package is available. Update your.upsun/config.yaml by commenting default type parameter and by adding the following lines:
Use Pandoc dynamically
You can now usepandoc in your project to dynamically generate a public/llms.txt file that will concatenate all the HTML pages in Markdown, as tested locally before.
Update your .upsun/config.yaml by adding the following lines:
/llms.txt to your environment URL:
Conclusion
Et voilà, we saw how to usepandoc to convert all existing HTML pages into a single Markdown public/llms.txt file. Now, perhaps the next step would be to train an AI Assistant with the file llms.txt…
Stay tuned.
Discover how to deploy a personal Chainlit AI assistant on Upsun by reading this great blogpost: Experiment with Chainlit AI interface with RAG on Upsun