Python OCR and Vision AI Automation

Hello friends, welcome to Python RPA Automation Series Blog. In my very first video, I covered how to download and use LLMA models and model weights on Ubuntu machine. In second video, I cover related documentation on how to do the same thing on Windows environment.

And also show you a use case, how you can use this model to monitor employee attendance and expenses in this video today i'm going to show you another very interesting use case i've been working on using language models as ocr and vision ei for my documents so let me browse to my github repository and i want to call out that all the links are included in the video description below now before we jump onto the details about the implementation let me show you a quick demo of this application and i will also give you an overview about what you are going to see so that you can decide whether to watch more or skip this video all right so our overall objective is to build an inexpensive ocr and vision ai on some real life production like data so these are the steps we are going to follow let's assume that one of your employee submit their expense sheet and upload any receipt or you receive an vendor in or it could be any document so what do you want to do you want to read the content of the document and you want to fully automate the whole thing so what we are going to do as soon as you receive that particular document or image you want to call it a script and read the content from that particular image now here I'm going to show an example very funny example suppose you want to read it from a screenshot and it could be anything so here just for the fun i'm reading the content of an you know apple stock today so and i will show you know how to build that thing so it could be in a screenshot it could be any document so as you can see this is a very complex document and there are a lot of information embedded in this one please pay attention to this this is a screenshot of a web page so the stock prices all sort of numbers are out there and one of the reason i'm using this because there are so many numbers i want to test the llma or maybe chart gpt how it's going to figure out the exact numbers so once we read the content of that particular image or web page what we are going to do we are going to build a dynamic prompt and we'll pass this prompt dynamically to call your preferred language model so for example llma or chat gpt and you will be surprised to see that both of them were very very accurately able to predict that so for example in all sort of those numbers i ask one question respond in one word the average volume of the stock in this text and you'll be surprised i was actually amazed to see that it was very accurately able to predict that uh you know that exact numbers for example 70 million or something like that that was the average volume of the stock and also I said hey can you prepare some kind of a table from that so and it was able to you know all of the sort of those numbers it was able to break it down and create a very nice key value JSON kind of a text here which obviously you can store that in a database because you know we is easy to read the JSON values here so that's we are going to achieve in this video today so enough with the demo let's go ahead out to our code now all right now since i got your attention and you are still watching this let me formally introduce myself my name is amish shukla and i train neural networks and finance supply chain healthcare data to predict useful patterns most of the work you see in my github repository is the result of my effort to predict supply chain shortage especially for healthcare during pandemic. Now in this video today we are going to build an inexpensive OCR. Keep on using the word inexpensive.

I don't want to use the word say cheap because cheap sounds too cheap. Now you might wonder why should I develop and work on another OCR vision AI solution. when there are thousands and thousands of you know available option in the market so most of the offerings you see in the market they are actually the wrappers and built around the open source OCR package and in this video today I'm going to use the same OCR package and I'm could rebuild my own OCR vision AI library here you may also ask there are a lot of law organization they offer the vision AI as well but those I found personally I find them very very expensive so and those are not actually trained on my real data so this use case that particular use case what I was showing you that was without any fine tuning on existing knowledge base. So if you pretend that if you tune those models and model base on your knowledge base that mean on your document I bet you were going to see the results which are 10 to 20 times or maybe 100 times better than these vision AIs.

And once you are using your in-house knowledge base to train these models you can definitely you know you can use this vision AI on like you know different use cases. For example you can use it as a document classifier, a dictionary, digital private signature or scanning the confidential information like PHI a private health or personal data on your document so there are a lot of like secured information in contracts and expense obviously you know those are your organization related document and you don't want to throw it on the internet so that's why you know you can download this llma model you can train those models on your in-house knowledge base and you are going to you know see better results uh those are like thousand times better than using any external vision ais so let me do a code walkthrough and assuming that you want to automate the entire business of So as soon as the file is received you want to take an action on that so in this step I'm you know this quote what you can do so for example Somebody do is an FTP or maybe a file is uploaded by any means so for example one of your employee drops a file to your sftp location here so as soon as the file is put this is the linux code you can use to put a file into a folder and as soon as the file is received or maybe you take a screenshot of image now this code what it does as you can see i'm using uh pillow library here and select both are open source absolutely free of the cost packages here so let me and let me show you an example here this is the url again you can change the url let's put your business requirement here i'm just use taking a screenshot of an apple image here now if you want to go through the details of this particular code what it does i have covered the entire things in this python automation scripts in previous videos please go through this video and here in this video i have discussed the entire source code line by line how i build this code and this code what you're seeing today is just a copy of that particular code what what I use earlier so again what this code does it takes a URL and it goes to that website it takes a screenshot and as soon as the screenshot is taken you want to take that a screenshot and you can download to a local PNG file so let me execute this all right as you can see this uh screen so i am taking the you are not seeing the full picture here but what it does in the background it takes the picture of the entire web page and is going to save it to a file called apple.png now you want to write another script as soon as the file is dropped you want to execute your automation script that you want to trigger another script here again this is the code i have discussed that in my previous videos how to write you know file drop you know how to write code which does a cron job or maybe you know which helps you you know as soon as the file is uploaded to end folder and how to execute another subsequent script based on that please go through this video and it will definitely help you do the entire code walkthrough and i'm just taking that code copying it here so what that code does again just to recap as soon as the file is dropped so as soon as the apple.png file is created in the download folder i want to call another script here and i want to you know this script as you can see it checks it checks that folder every 10 seconds seconds all right now I want to read the text from the images same thing I have already covered this in previous videos there's the reason I created that whole RP automation series in the past I have covered all of this entire steps line by line and these are the mini snippets of the Python code so go to my RPA repository one more time and here you will find the detail walkthrough of this particular code what it does but basically long story short it uses a PI Tesseract OCR library OCR package here and it takes what it Whatever content you pass into this, you pass it as an image. What it does, it takes the image and reads the content of the image into text. So as you can see, I defined a function here and I'm passing that apple.png file to this particular function. Let me call this function.

What this does, it takes that screenshot what we have created. It's reading that particular image and it's just capturing the content of that particular file. all right next step is what we want to do so now we have read the text content from that particular file particular image or document now it's time to build the prompt so whatever we have captured in that text text variable here that means content of that particular image we are going to build a dynamic prompt now prompt here i'm keeping it very very simple i just want to ask one question or maybe i want to have an explanation so i'm building two different prompts here i'm saying you know what define in one word what is average volume of this particular average volume of the stock in this text see and then I again in the previous video I have covered this I define two different functions one calls the chat GPT and other calls the LLMA please go through these videos that's why I have covered all of these details in previous videos so that I can build upon it so these functions as you can see very simple LLMA again in previous video this is simply you know using you know if you download the LLMA model you will see a file called example text completion and and here in this there is a variable called prompt all i am doing i am just replacing that with the content i just read so simply what i'm going to do i'm going to call that function but instead of the you know i'm just going to man i'm going to update this prompt variable variable here with the content what i just read from that file all right so and then there are two types of prompts i'm testing with please be creative create your own prompts and this is entirely about the prompt engineering passing about the passing more information statistical statistical information you pass on to your prompts you know more creative you are better results are going to be so again this is a prompt engineering if you are interested like you know just please because you know your data better so you can ask the you know more relevant question here I'm asking one simple word saying here respond in one word what is the average volume or what is the average price or was the today's price or the previous day's price closing price of that particular stock in this text so please play around with this and what i did you know i asked this question and i was able to you know find you know the answers were very very accurate and both the results so let me call this let me show a live demo on the llma so llma and i will show you the chat gpt in one one minute let me define this function Okay, now let me run this So for the first prompt let me print that and first prompt was I'm asking one question.

What is the average volume? at that particular Average volume in that entire what is the average volume of the stock Apple stock price in that entire text? So it's good scan through all the entire text and it finds the absolute value that hey the average volume was 70 million now similarly let me change the prompt to prompt text and here is going to i'm calling that llma asking you to print the details so here if i execute this function let me say rest to Pinterest rest to oops sorry printer s2 and it's going to print all the details of that la la ma function what we just received again you will be you know this is amazingly what is like is already putting the data in a JSON a tabular kind of format which I can directly use let me show you a demo because easier to show it on the chart GP interface this has the web interface is easier to see as you can see you can test it out that you know I pause the entire data and i said hey you respond in one word and you can describe that and chat gpt is actually it was more accurate but you can again it depends on your machine configuration and what kind of models you are using all right so that's all i wanted to cover in this video i hope you like this video if you have any question please feel free to open an issue log at my github repository and i'll be happy to help please subscribe my channel and thanks for watching thank you

Transcript for:Python OCR and Vision AI Automation

Transcript for:
Python OCR and Vision AI Automation