Transcript for:
PDF Analysis Techniques and Applications

PDFs come in all shapes and sizes. You have PDFs with pure text, you have PDFs with a mixture of images and text, and then you have the PDFs from the 1960s that are scans, black and white, diagonal, hard to read, and those are the absolute nightmares even with GenRVI. In this video I'm going to demo how to use the brand new feature from Claude called PDF analysis which allows you to deal with very complex PDFs and be able to wrangle PDFs that have a mixture of either images and words or pure scans and images to be able to take advantage of documents that before were really hard to process but now they're fair game. For industries that are very blueprint heavy like construction or very manual heavy like in the healthcare space where you have a lot of studies that may or may not have existed as photocopies, this is the beginning of a game changer. And I'm going to show you how to use this feature in three different ways. One, through using Cloud's front end. Two, using a back end using Python. And three... how to actually use Replit to import this feature into Custom GPT if you don't want to switch providers. I'm going to start off with a more funny use case and work my way from there to get to some more business-oriented and practical ones. Let's dive right in. All right, so if you log in to Cloud, you'll see this feature preview pop up that tells you about three brand new features. One of them is their analysis tool, and the TLDR of this is pretty much, it's gotten way better at doing data visualization. I might do a video on that later on, but for today, We really care about this specific release which is called the visual PDFs feature. So you can see here it says gives Cloud 3.5 Sonnet the ability to analyze images, charts, and graphs in PDFs. And you can see here, like I was referring to, a very old manual from in it seems like the 70s that seems like a diagonal scan. So it can handle really complex PDFs and I battle tested it with one use case that really appealed to me. Now one of the banes of my existence is, you guessed it, building furniture, specifically IKEA furniture. And the reason is, a lot of the times when I go through the manual, I find myself needing to go on YouTube immediately to see how to actually build the thing. Now even though I'm not that handy, I'm usually pretty good at following instructions and sometimes these instructions are impossible. So believe it or not, when this feature came out, the very first thing I did was load one of these IKEA manuals in and ask it Can you explain this instruction manual to me like I'm a fifth grader? So let me show you how that works. All right, if we go into Claude here, and what I'm going to do is I'm going to pull my Claude PDF analysis, and I'm going to take this bed frame PDF, put it here. And one thing I want to make sure is that I have Claude Sonnet new toggled on instead of the other two. And if I send a request and say, Hey, can you explain how to... build this bed frame like I'm handy challenged and unable to follow basic instructions, basically like a child. It will go through and there's not much text there. It should be able to take basically screenshots of every single page and then decide is this text or is this image. If this is image I need to interpret it. And you can see here the very first thing. The pictures on page two are telling you use a screwdriver, don't work alone. And if you go back, let's bring this side by side just to make it easier. If you go to the second page, you'll see that's where this diagram is. So it read that perfectly. And then it goes to the actual building and it made it in my perfect language. See those six little rectangular things. This is exactly how I speak. They're like connectors, yada, yada, yada. And you can see here it breaks down everything. And my most favorite feature. is that if I go here and I go to one of these little items and I screenshot this and I copy it and I paste it back into Clawd and I say where do I use this screw? It should be able to go through the entire PDF, see which image is referring to the serial number or the number assigned to that screw and tell me which page it is. So it says here You'll notice in step three of the instructions and in the final diagram, these slats get inserted into the side rails of the bed frame. So if we go to, it says here page six, let's go to page six. All right, and it says 101 359. And if we go back to the actual picture, then we go, let me zoom out just a tad if I can. There we go. 101 359. So I was able to actually find the exact page. where this is used. So you can imagine if you're using the Claude mobile app and you took a picture of the screws on your carpet or floor and asked that hey where do I use this? This might save you tons of YouTube videos from 2005. So huge update for me. Probably not for you as a business owner, but still worth mentioning. Now, if we pivot to a more serious use case and we start a new chat here, and I go back to my files, if you see this one that I pulled, it's the United States Navy K-Type Airships Pilot's Manual. And you'll see this is literally from, can't find it directly, but if I had to guess, probably the 40s or 50s. So if we go down to a random page here, let's go to... this air pressure system. So I'm not gonna act like I'm this type of engineer, but we have air intake, we have air discharge, we have dampers, we have an air pressure system, air duct to gas space. So if I load this PDF in, and I, let me just try that again, there we go. All right, and we say, can you explain the air pressure, pressure system diagram to me like I'm not an engineer. Now this file, one thing I forgot to mention, is it's around 82 pages. The original file was 115 pages. One thing, because this feature is still in beta, that you should know is that it can't process files for now above 100 pages or are larger than 35 megabytes. So if it's larger than 35 megabytes, what I recommend is maybe converting that PDF into a compressed PDF. There's a lot of free tools you can use online for that, or possibly to a text file, which again you can do online pretty easily. Now if you have a file above 100 pages, what I recommend is that you maybe split it into 40, 50, or 40, 40, or do a couple groups of 30, just so you can actually process that whole PDF, maybe one request at a time. I imagine these are only restrictions that apply for now because it's in beta, and I would imagine when it comes out of beta, you'll maybe have some more breathing room to dump a whole book or a whole article or budget and have it actually compute it all in one shot. Now, if we send this request, it took around 30 seconds, and now it's going to break it down step by step. And one word I was looking out for here, a keyword that appealed to me was the word dampers. So if we go back to this page here, you'll see dampers, and I was very curious to see if that would pop up. And you'll see here, it goes through the air intake. It's breaking it down for me and then it says the pilot can control where the air goes using dampers like valves, yada yada yada. So if I ask it, which page is this diagram on? Curious if it can help you locate that. It originally says page 24. If I go to it, it says page 38. So that's necessarily no bueno. If I ask it, are you sure? In this case, it's asking me for advice. So maybe when it comes to the actual citation of these pages, it's not necessarily spot on, especially with larger documents. but it was able to actually find that diagram, which is technically all I care about. One more file I'll try on the front end is a completely different sector, which is the investment sector. So if I go to this PitchBook Venture Monitor report and I scroll down, you'll see that in some of these, if I zoom in, you have some executive summary graphs, line graphs. If you scroll through, this one's about deal making. So deal activity pulls back from quarter two. So we see you have a line graph here. These seem to be in the billions, the ideal value in the billions, deal count also in the billions, and you have these dots which is estimated deal count. So I'm going to fire up a brand new chat here and then go back to the PDF analysis and pull up PitchBook. This file is pretty meaty, so if I upload this you'll see it manages to upload. It's around 17 megabytes and if you remember our cutoff is 35 megabytes. So if I ask it a question, talk to me about the deal making trends so far in this report. If it's smart, it should hopefully be able to reference that chart about deal making and speak to the actual trend that's happening in some level of detail. It comes back with a mixture of observations that I can't necessarily tell which ones are from graphs and which ones are actually from the document itself. So one thing I'll battle test here, it'll say can you pull stats solely from graphs that you see related to deal making and break down those trends with some analysis of your own referencing the specific pages if you can. So again, on the pages front, because it's in beta, I think it's not fully there yet. But let's see if it can at least extrapolate and tell me from the trends what it's seeing. You'll see this time it says deal value and count trends on page seven. and if we go to page seven you'll see it's exactly the page i was referring to which is the deal making page let's see if it can do some analysis so it says the q3 deal value is 37 billion around from down 55 billion in Q2 of 2024. And if we go back here, we'll see that Q2 correlates to around 55 billion. And this correlates around to 37 billion. So we got that right. And that's referring to different pages. So we see deal size trends on page eight, and it's going through different medium deal sizes. So if we go down here, it's going through this actual graph. And you can see it's referencing medium VC deal values, which is awesome. So now we're in a place where we can not only understand the actual text in these documents, but it can actually go through and make some remarks and analysis regarding the actual visualizations and any underlying trends. So this was a demo using the front end of cloud, and now I'm going to actually use an API key to try to access this API of the PDF analysis programmatically via Python. One thing we're going to have to do to actually use the API programmatically, is just create an API key from Cloud. So what you can do is go to console.anthropic.com or there are other ways to get this through what's called Open Router. But just for simplicity, we're just going to go here and we're going to go to Get API Keys and then we'll generate one. So I've already generated one for the purpose of this video that I'll delete right after called PDF Test down here. And I've prepared a Google Collab workbook that anyone can use whether you're technically inclined or not. And all you have to do is swap a few things. All you have to do to actually run this code is enter your API key here, which I entered myself. And then the questions that you want to send to the API. Now, usually you can only send one request using their default code. But I basically refactored it so you can add multiple questions that they all ran in tandem. So you don't just do one question per PDF. You can rip a couple if you have them. And you'll see down here, I tried to use some HTML to format the responses so that you don't get this formulaic. long bracket that you have to scroll through to see the result. So if we go back up here, literally all you have to do is other than entering your API key and the question you want to ask. So let's just say, what is this document? Can you summarize what's in here? And all we have to do is click on runtime, run all, and then this will install Anthropic and the packages that we need to run this. And then this will prompt me to upload a PDF. You'll see here, it'll say choose files. So I'll go. And I will choose, let's say, this file on fridges. Basically, a fridges manual on how to set it up, maintain it, etc. So if I upload that, and you'll see here, it's done processing the first question, which was, what is this document? And one additional thing I added to the code was basically replaying back the question you asked, just so if you have four or five questions, you can keep track. So this says it's an installation and operation manual for the Meadows commercial refrigerators and freezers. yada yada yada and these are the models and if you go to the actual document you can see here it's a manual and these are all the underlying things here if you scroll through technically we would have asked it you know what is a compressor where is that located etc you could ask all kinds of things about these files especially when it comes to these schematics here so that would be fair game as well but we already saw how that works elsewhere if we go back it's answered the second question which is can you summarize what's in here Again, it says it's an installation guide and it goes through everything with the manual cover. So that's all you really need to do to actually test out this code. If you want access to this, you'll find it in the Gumroad in the description below and you can use it to your heart's content. Now for my third magic trick, I'm going to show you how you can now access this through a custom GPT through a custom middleware called Replit that I'm going to use to build a bridge between this Cloud API and a custom GPT. If you're asking, why would you even bother bringing this into a custom GPT and paying API costs? Well, for me. I don't necessarily like to change my entire workflow of the different GPTs we're using just to actually use one feature in Cloud. So I like to test it in Cloud and then bring it into Custom GPT via Bridge so I can make use of it and have it work with the other agents I've set up. If you're not familiar with how to use different agents in Custom GPT, I highly recommend you watch my last video and see how you can apply that. Before I give you a preview of how to actually recreate this Custom GPT on your end, I'll just show you a demo of actually using it. So if you click on analyze this PDF, I've made it so that I can basically bundle a URL of a PDF as well as the question you want, and it will programmatically access it using this API. So if we go to the construction use case I mentioned at the very beginning of the video, you'll see here I have a document that's a sample drawing for two different houses and all the blueprints and schematics, which prior to this API, for any clients that wanted us to process something like this, we have to set up a whole workflow. that will take a PDF, loop through, take a screenshot, process each screenshot with a prompt, sometimes even separate prompts, and then come back with responses to questions in a very convoluted... an expensive way and sometimes we'd have to layer on what's called OCR or optical character recognition to try to basically decipher all these tiny bits of text. In this case if I go to a random page here let's say page number six and you can see as I'm scrolling through it's lagging from how thick of a file it is. If I zoom into here it's a picture of what seems to be like a three-story house and it says here wall assembly w6d It's going through all the materials you need here. So two layers of half type inch, type X. Okay, I'm not a handyman. That's why I needed the IKEA guide. So if I go back to here and I just take this URL and I paste it here. And then I ask a question. I'm at Home Depot. I'm trying to do the wall assembly. Can you look at the file? and tell me exactly what materials I need specifically. I need to ask the person at cash what to buy. So this will prompt the action to pop up and I'm gonna click confirm here. And you'll see here, it was able to speak to the document and tell me these are the specific materials I'm looking for and if I ask it, can you tell me specifically what wall stud size do I need? I need some specificity here. Apologies for that spelling. It should now resend that request paired with that PDF behind the scenes. And then hopefully we should get back some more detailed responses. All right. And now it's coming back saying the wall assembly needs a two by six wall studs. And if we go here, I'm assuming that is correct. And I can't actually discern it from here, but if we ask it. something specific like what about the gypsum board I don't know what that is but if I say what about the gypsum board I had to click that confirm button again and if you want to minimize that happening you want to go to privacy settings here and then click on always allow again this won't always work but it'll help alleviate at least some of the cases where it pops up so didn't come back with too much specificity here but you get the main idea is that you can actually pair the URL with the question and try to see if you can actually find the answer within the document Now, if you're wondering how I built this, it was a bit more complicated than expected, but you get the benefit of getting all of my hard work without all the pain. So all I had to do was in EditGPT, I created this prompt that basically says, if you'd like to analyze a PDF, follow these steps, basically give me the link. And then one thing I added to their API, basically in my Replic code, is the ability to take shareable Google Drive links. Inherently, if you use Google Drive links or OneDrive links, there's all these redirects that happen. due to authentication that make it really hard to look at any PDF or any document for that matter hosted in those environments. So you were kind of restricted to using PDFs hosted in different environments. So if you went to the internet and said, you know, body building guide PDF, and then you found something like this that has PDF, something like this would work pretty well, assuming it's not 144 pages, maybe 99 pages. But I want to be able to be extra lazy and be able to upload a file to my Google Drive, just take that link and make it shareable. So that's what I'm specifying here. And then other than that, I'm just making it ask the user to ask a question paired with the URL. So it sends that request at the same time to my function that's called ask Claude question. If I close this, I made a conversation starter that says analyze this PDF. If I go to the custom action, this is all gibberish. all you have to care about is the fact that if you want to replicate this, all you have to do is take my replica code, deploy it, and then you'll have a URL, you can swap my URL for yours, and you'll be all good. If you dive in the code, you'll see we're referencing the newest model that just came out end of October. And then we have to basically specify that we're using the PDF API. And then everything else is pretty much gibberish. This part is literally what I set up to be able to accept Google Drive links. So the idea is if I take one of these PDFs and I drop them here, let's take the IKEA one and drop it. All I have to do is right click, go to share and just make it shareable like by view. So if I make it by viewer and I copy this link, I've set the code up so that in here, if I click analyze PDF and I put this and I say, how do I build this? You'll see it's now able to speak. to that URL from Google Drive, making this a lot easier because I was banging my head against the wall trying to find PDFs that had a URL that would actually gel with this. API. And you'll see here it works perfectly. It went through and it came through the materials exactly what I need to do etc. So this makes my life easier and now it's very easy for me to upload a bunch of reports or a brand new study about LLMs and quickly be able to discern all the schematics or diagrams that are in it. If you're curious behind the scenes what we're getting is a payload or a response like this. So this is the actual raw format when you ask a question it comes like this and obviously GPT is bringing it in and formatting it. to make it more easy on the eyes. Now if you want to replicate this code, I'm going to be able to share this link here. All you have to do is go into your version of Replit and I'll paste this here. And all you have to do is click on Fork. After you click on Fork, you just click here. And then the only thing you have to do is click Run. It will tell you you're missing a secret, which is your API key. So once you run this here and we go grab our API key. So if I go back to this, Last API key here. I think we put it in the Google Collab. So let's paste this and go back and paste it and then add secret. Now you'll see it's running okay. And when you hook this up and basically deploy it, you just have to click on deploy, auto scale. And then again, if you don't pay for their core plan, you won't be able to deploy it. So again, always double check, is this worth it for you to pay 25 bucks a month? Will you be using multiple services on Replit and then hooking that up to a custom GPT or wherever you want to hook it up? That is up to your prerogative, but if you want to go down this path, this is what you have to do. As I said before, you're going to get access to the Google Collab, the Replit code, and the prompts and schema I use in my custom GPT all in the gumroad in the description below. If you choose to support the channel, I'll love you for it. My prediction is that in three to six months, this API or many others as well we'll be able to have you drop a 500 to a thousand page pdf that's either from the 50s or the 90s or the 20s and whether there's images or not i'll be able to process it very quickly and probably very inexpensively but for a lot of use cases that were kind of locked away from you know the average user and the average business owner this api or this feature at least in cloud is to me the beginning of a game changer if you love content like this let me know in the comments down below it really helps the algorithm and it helps the video And I'll see you guys next time.