ai Featured

My Experiments with Open Source AI/ML to Improve Team Productivity (Part 1)

Shrijith Venkatramana

Feb 20, 2024 • 7 min read

I saw everywhere recently that OpenAI's SORA is making waves across the web.

SORA is capable of making videos out of textual descriptions.

I went to their website, trying to give it a go.

No way to try it immediately, though.

I'd have loved to make some videos for my social media accounts or explainer videos for our apps.

So, from such a frame of disappointment, I started thinking.

These days, I want to find newer and better ways to bring AI/ML into my team's workflows.

That's the singular thought in my mind.

What can I do - for myself and my team - to enhance our capability and productivity with AI?

What can I do NOW?

I'm determined to pick up some good AI/ML skills, while also serving my team.

Then it hit me: let's start with something a bit more modest.

We have many, many informative wiki pages accumulated over the years.

Why not build a nice and simple chatbot on top of it?

So, that's what I set out to do.

This post outlines the first part of my journey in building up a documentation chatbot.

All forward-looking teams must set up a Documentation Chatbot

While new technologies come and go, organisations always want the same thing:

Better cooperation
Higher Productivity
Greater Performance
Increased Profits
Lowered Costs

To get to “Increased Profits” it is important to start with “Better cooperation”.

I’ve found through personal experiences and scientific evidence that teams work better when there is high information sharing.

More writing. More reading. More Q&A. More discussions.

All the above help with raising team capabilities.

Your team probably already hosts a bunch of documents or wikis for information sharing.

If not, I highly recommend adapting a document-sharing system and encourage your team members to write often.

I can’t count how many times the wiki has saved our team from total disaster.

Faster access to relevant information means quicker debugging, less friction, and so higher productivity.

And a Documentation Chatbot is perfect for taking your document system to the next level.

People can chat with the AI, ask specific questions and get polished yet relevant answers to their questions.

So, I’ll say, all forward-looking must set up a Documentation Chatbot if they care to increase performance.

Otherwise, teams without these sorts of systems will lose to teams that do avail of such facilities.

FeedZap: Read 2X Books This Year

FeedZap helps you consume your books through a healthy, snackable feed, so that you can read more with less time, effort and energy.

A simple framework to build custom Document Chatbots

So, the question is: “How does one build a custom document chatbot?”

The answer is quite involved and you can find detailed explanations elsewhere.

Here, I will give you some pointers in 4-5 sentences.

These days, the most used method for document chatbot is RAG (Retrieval Augmented Generation).

Sounds sophisticated, but RAG is quite simple.

There are two phases to its working:

Given a question or instruction, find the most relevant documents
Pass on the relevant docs to an LLM such as ChatGPT & let it answer

The above two steps can be split further into the following steps:

Store all documents in a vector database
Given a chat prompt, find the most similar documents
Send these documents to LLM, such as ChatGPT
Get synthesised answer

In Part 1 of this article series, I will explain my learnings setting up a vector database, indexing our wiki in it, and then finding relevant documents for a given query.

Easy to use Open Source tool to store data for Document Chatbots

The first step is to select and configure a vector DB.

Historically vector DBs have been comparatively complicated to use and manage, particularly in production.

They were difficult for beginners to get started with.

But today I will introduce Chroma DB, which is beginner friendly.

It drastically simplifies the storage of documents, and retrieval of similar docs from a vector DB.

Please note that when I say documents, I don’t necessarily mean text content.

For instance, Chroma DB supports multi-modal data storage. As of now, it can store both text and image formats. Hopefully more data formats get supported in the future.

Some Important Notes to make your life easier when using ChromaDB

Before we get started, I am going to explain certain things about ChromaDB to make your life easier.

Getting some of these settings wrong could waste a lot of your time on relatively unimportant things.

First, by default, ChromaDB runs in ephemeral mode. That means, you lose data when the process exits.

I do not like this default setting, and I prefer to run ChromaDB in persistent mode. In persistent mode, you’ll not lose data on process exit.

In the persistent mode, you can open ChromaDB contents in Sqlite Browser and inspect the data. So get it installed, if you’d like GUI access to the database.

ChormaDB is capable of running in client/server mode. Traditional DBs such as postgreSQL and MySQL run in client/server mode. But this is not necessary with ChromaDB, since it is a file-only database like Sqlite. So I will be avoiding the client/server mode in this post as well.

How I Indexed Hexmos Wiki with ChromaDB in < 20 Lines of Code

The first task is to index our wiki contents within ChromaDB.

We have ~200 wiki pages right now, in our wiki alone

On top of that we have many Google Docs, personal pages, READMEs, web documentation, building up a solid knowledge-base

First, I started by cloning our wiki into a directory - hexwiki

The wiki is structure is a hierarchy of directories, each containing a number of md (markdown) files

I could have downloaded all our Google Docs and other stuff into the directory as well

For demonstration purposes, I’ll skip that complexity in the post

Initialising a persistent ChromaDB database file

import glob
import chromadb

settings = chromadb.Settings(allow_reset=True)
client = chromadb.PersistentClient(path="./rag/hexwiki.db", settings=settings)
client.reset()

We first import chromadb, and other dependencies
Create a Settings object. We want our database to be recreate on every re-run (during development). So, we have allow_reset = True
Next, we create a PersistentClient.
Once we run the above code, a database will appear as follows:
Let’s open the database in Sqlite Browser and see what’s inside:
As expected, we see a bunch of tables. This means, the database has been initialised successfully.

Indexing Markdown Files within ChromaDB

md_files = glob.glob('./hexwiki/**/*.md', recursive=True)
collection = client.create_collection(name="wikicorpus")

The first line here gets a list of all the file names within the “hexwiki” directory, recursively.

The second line creates a collection named wikicorpus in ChromaDB.

A collection is a group of documents in ChromaDB.

We can make queries on top of a collection to find relevant documents.

Next, I will list the overview function for indexing all the markdown files:

def index_wiki():
    count = 1
    for md_file in md_files:
        print()
        print("###")
        print(f">> Processing {count} - {md_file}")
        contents = file_contents(md_file)
        fingerprint = get_fingerprint(contents, count, md_file) 
        add_to_collection(contents, md_file, fingerprint)
        count += 1

There are 4 important lines in the above function.

We loop over all the markdown filenames
For each file, we read the contents as a string
Based on filename, contents, and a count attribute we generate a hex fingerprint
Finally, we use ChromaDB API to insert document into the collection

Let’s quickly look at how each part is implemented.

Reading file contents is basic python, and quite straightforward:

def file_contents(md_file):
    file_contents = ""
    with open(md_file) as f:
        file_contents = f.read()
    return file_contents

The fingerprint generator creates a unique identifier for each file content. To ensure uniqueness for each file, we use contents, iterator count, and filename.
def get_fingerprint(file_contents, count, md_file):

def get_fingerprint(file_contents, count, md_file):
    fingerprint = str(count)
    if len(file_contents) > 10:
        fingerprint += file_contents[:10]
    else:
        fingerprint += md_file 
    return hex(hash(fingerprint))

The fingerprint generator first builds up a string.

It starts with the count element.

Then, for significant file content, we append the file contents. For smaller ones, we append the filename.

Finally, we return a hexadecimal string representation of the hash of the fingerprint.

The final step is to add the file contents into the ChromaDB index:

def add_to_collection(file_contents, md_file, fingerprint):
    collection.add(
        documents=[file_contents],
        metadatas=[
            {
                "filename": md_file
            }
        ],
        ids=[fingerprint] 
    )

This is where ChromaDB’s simplicity shines. The documents part is where we pass the actual content to be stored. The metadatas part provides facility to tag each piece of data. We can use these metadata attributes later to filter data from the database. The id uniquely identifies a given piece of content. We can use the ID for update or deletion later.

Retrieving Similar Content from ChromaDB

Now, we are ready to fetch relevant content from our ChromaDB file. In my particular case, my team has a self-hosted gitlab. So I’ll search for gitlab related content. I’ll restrict the results to a maximum of 5 items.

results = collection.query(
    query_texts=["gitlab"],
    n_results=5
)

Almost instantaneously, I get documents super relevant to my query.

The next logical step here is to send all these documents along with the user query to an LLM such as OpenAI’s GPT.

LLMs can synthesise a coherent & relevant result, based on user query.

Upcoming in Part 2: Get Chatbot Answers based on Vector DB documents

That’s it for Part 1 of my experiences building a documentation chatbot for my team

In the next part, I will explain how we went about getting an LLM to formulate relevant answers to our questions.

Stay tuned