if you are watching this video you are most likely familiar with the chat completion process using large language models and in particular the open AI llms in other words the process we use to create a chat bot to conversationally query our data you probably also know that the first step in that process is the ingestion phase where we take our document usually a PDF file and convert it to a text file if that text file happens to be too large to fit into the context window of our large language model then we have to chop it into smaller chunks so the next question is what is the best way to chunk our data there are a lot of Python scripts available which will chunk your text files by sentence paragraph size and other factors in looking at what was currently available I decided to create my own methodology I call this methodology semantic chunking I am working with a large PDF documents that I wish to embed in order to be able to use the miscontext for chat completions by embed I mean absurd to a vector store which I will query to retrieve context documents for my chat completion calls to the llm the problem here of course is that most large language models have a limit to their context windows that is the total size of text that you can submit in one prompt or query the current limit for openai's DPT 3-5-turbo model is 4096 tokens the limit for rgbt4 is 8 000 tokens which is about 16 Pages soon to be expanded to 32 000 tokens which is about 50 pages of text since this is all in in fine and well and good but the cost of GPT 48k is 6 cents per 1000 tokens and the estimated cost of gpt4 uh is the GPT 432k is 12 cents per 1000 tokens when you're executing queries on 10 20 30 page documents at a time this is going to get pretty expensive pretty fast so I am currently planning to work with GPT 3-3 to 3.5 Dash turbo for my check completion projects for the foreseeable future as I said this model's token limit is 4096 tokens which is about only eight pages but the cost per 1000 tokens is .002 cents I can live with that but how do I work around that eight page limit when my documents are 20 30 100 or even in one case 700 pages long most of the documents I am working with legal contracts and agreements regulatory code policy manuals research papers and even some news articles and blog posts they all have an organizational structure for example a common regulatory code structure I see a lot is as follows title division part chapter article section a news article or blog post may be broken down into title or that's the main topic and subtopics a legal agreement might have this breakdown chapter sub chapter articles the point the point being that there is usually some sort of hierarchy or structure to the information presented in each document and usually this is a grouping of semantically similar ideas I call this the semantic schema of the document so my goal in chunking is to break a document down into its basic semantic ideas as represented in its semantic hierarchy and I refer to this as semantic chunking I want to put together a question and answer knowledge base of California real estate law this would be designed as a sort of tutor for people studying to take the California real estate exam and advisor assistant to existing Realtors and Brokers and a research assistant for legal experts in the field all the documents I require can be found on the California Department of Real Estate website you click on complete list of Publications then scroll down to licensee examinee Publications I am particularly interested in the real estate law book and the reference I'm sorry the real estate law book and the reference book real estate guide if you click on the reference book link it would take you to the page which lists the introduction and uh 27 PDF chapters that make up the book um if I take a look at uh in the first chapter the department of the California Department of Real Estate you can see that it is broken down into several sub-chapters government regulations of brokerage transactions and we come on down here and we see original real estate brokers license and we'll continue on down corporate real estate license further original salesperson license and it just goes on like this so um I have written code that breaks each chapter down by sub chapter chunks and uh it places a header in each of these uh sub chapters that has a citation link to the actual subchapter as you can see here um it also has a link to the chapter itself which I just refer them back to the PDF on the California Department of real estate site um it uh then lists the hierarchy of the uh where the subchapter belongs so for example this is uh part of the reference book and this happens to be chapter one and this is the subchapter title government regulation of brokerage transactions and as you can see here right I just continue to do this for each of the sub chapters that belong in this particular chapter in most cases each of these chunks will meet my chunk size requirements however in the case of Chunk text that exceed the character limit I have set I then chunk the chunk making sure to include the same header in each sub-chunk if you will I found that adding this header helps to maintain a contextual link between chunks of the same document in order to achieve this level of semantic chunking you will need to one determine the semantic hierarchy of your document two determine the base semantic element you wish to serve as your Chunk in other words chapter article section sub-chapter Etc three export your document to the plain text format for Implement code that chunks the text according to your specifications needless to say the most difficult part of this is step four it is going to require some knowledge of text parsing back in the 1990s I used to write Pearl scripts to semantically mark up regulatory text documents for processing as folio infobases that was an early precursor to today's search and retrieve knowledge base so for me this is just old times um if you currently know how to code and use regular Expressions then you know what to do if you know how to code but uh but aren't familiar with regular Expressions X chat chat GPT codecs how to generate code to break the document down is we have discussed here if you don't know how to code then your needs to search and see if there is already existing code to do what you need or find someone to do it for you or learn how to do it yourself in your case chunking by sentence or paragraph or even size may make the most sense in my use case chunking according to the semantic schema of the document is the best approach there are other approaches of course and code that already exists to implement them my goal here was to Simply offer an alternative that I've not seen suggested elsewhere that is currently working for me whatever you choose I wish you the best for me this is one of the most exciting Journeys I have ever been on in my life