Generative AI Training Session Notes

Not yet. Okay, it's on. Yeah.

Okay. Yeah, sorry for that. Thanks everyone for coming.

I just decided to take a certification. Took me 28 minutes, but I passed, so it's nice. Yeah. So if you didn't pass yet, you should also try it. Okay, so again, thanks everyone for coming.

So this is actually my fourth session. So the initial one was on Monday and Tuesday, it's called Generative AI Training. So a lot of people have been asking to actually have a repeat because the room was very small. So thanks a lot for coming.

So just start by presenting myself. So I'm Anastasia, part of specialist team here at Databricks. I'm based out of Paris.

I've been working in AI for quite a long time and Prior to that, I've been actually working as a researcher. Also in big data, I've been working with meteorological data, so I also specialized in geospatial on top of AI, and I'm also a part of subject matter team of machinery and also LLM at Databricks. So this is an intro session. There's going to be no code, but the goal is to understand actually more about Gen AI, talk about use cases, and also talk actually what we see in terms of questions where people struggling. Probably all of you already seen the keynote, but we're also going to touch upon the ecosystem actually and the vision of Databricks, and where we see the differentiator between, let's say, pre-trained models and like these commercial models versus the models that, well, you can actually tune yourself.

Again, we're going to try to answer really like, Three simple questions, they're probably way more than those. So if you have any questions, I'm very happy to chat with you after the session. But we're gonna try to answer a few questions with the session. So first of all, people asking often, is this a threat and how to prepare my organization?

To this type of questions, right? Because of course some of us may be actually managers, directors. So how can actually I prepare my staff to understand that this is threat or not threat, right? So then the second one would be how actually the Gen AI can help my business or your business. to gain a competitive advantage.

And then how can you actually use your data securely? Because when you're using proprietary things, like, for example, OpenAI or other tools, where your data goes and how to actually secure your data and how to understand what can actually happen with your data and basically what's being used, right? So... So there is a few goals basically that this presentation tries to cover.

Of course, again, that's mostly a baseline, but it should give you a nice understanding what to think about when you actually will start making Gen-EI or using Gen-EI. So we try to, first of all, just understand why today it's such a hype, because in reality it's been around for a very long time already. And why today people start back talking about Gen-EI and what is actually from that for you.

And then we're just gonna try to understand how this Gen-EI can make your business successful and what actually, let's say, what we see from the field. So it's basically what we see with our customers. How to identify those use cases and how actually prioritize. So we're gonna talk about this pyramid of basically importance, how to decide. where to start and how to start, right?

Because that's important. We also gonna touch a bit upon ethical issues and bias of these models, because it's very important to understand that it's not just kind of black box toy, right? It can also harm people. So that's gonna be the agenda, basically talk divided by, so in three parts. Really genii basics to understand what are the magic under those.

Not gonna be any formulas, but we're just gonna walk through what is actually under the hood of this. super powerful toys that today we call Gen-EI. Then we're gonna talk about applications.

We'll try to highlight the most common applications that I see and also give you a few examples and also kinda let's say not tip you but tell you what I think is kinda the low hanging fruit so that it's simpler to start and then maybe it's also gonna give you ideas where you can start. Then we're gonna talk how to prepare for this adoption or how we see people preparing and then yeah, as I said, Hopefully I will have time because that was actually a two hour class that I had to cut for 40 minutes To talk about some legal and ethical Considerational issues because that's also very important So what is actually gen ei for people that maybe not? You know, not at all from this tech field and maybe you're from a business unit, right? So in reality, we initially would talk about artificial intelligence. So it's something that we try to mimic how humans think, right?

So to mimic those intelligence. Then we came to the part of machine learning. Machine learning use some data, try to find patterns. mostly from tabular data, try to find some linear relationship correlations.

Then we moved forward with the deep learning networks, right? So they try to actually mimic Sort of like those neuron connections in our brain so they would transform a lot of data extract some patterns And then this is where the gen AI actually stands. It's under this deep learning So generative AI is basically a deep learning that is a bit more complex and based on different Let's say mathematical functions than a classical deep learning So, Gen-EI, again, it's a subfield of deep learning, and there are a lot of contexts where actually we already have seen Gen-EI being used.

Okay, so if you're owning a phone, your phone probably has a lot of actually Gen-EI inside it. So the most popular one is Siri or Alexa, right, or Google Assistant. All of this is actually Gen-EI models.

One of the oldest, let's say, Gen-EI technology probably all of us using is actually Google Translate. So you've been probably using Gen-EI for ages. But what changed, right? Let's say, okay, iPhone, Google have been there for quite a long time, but today everyone talks about Gen AI, user generation AI, been written all around this place. So there have been changes that actually brought us to accessibility of those things.

So this is very important, right? Before, those technologies have been very proprietary. and also the data being very proprietary. So it's been very complicated for people to actually get even access and understanding, right?

We cannot just go and Google like a code source of Siri, right? We just cannot do that. So there's been a lot of technology that's been open-sourced. There's been a lot of studies and also a lot of data sets because in reality, what's the difference between Gen-AI compared even to deep learning is that deep learning requires your data, but Gen-AI requires you way, way, way, way more data. So it really needs to learn.

To learn all of these patterns, you require a lot of good data sets. So it's not like something you can manually clean in a few days. So you're really required to have proper sources that's been analyzed or used by other people. So this is the first pillar that's been appearing, like large data sets and also pre-trained state-of-the-art models. Then the computational power.

In reality, what we also need to understand is that when we train our machine learning model, we could actually do it in a local laptop. Then deep learning came, and again, it was still possible to do it on a local laptop because some of us has nice cards from NVIDIA and maybe some companies bought GPUs. But today with Gen AI, the models that type of GPT-3 or GPT-4 would not be able to be trained on your local computer and not even on a few computers, so you require really a lot of computational power. And then where you look for this computational power in reality today, it's actually cloud providers. On top of having accessible GPUs or TPUs, you also need to actually have very particular types of instances.

So you cannot just buy a random cheap GPU to train actually GPT-4 on it. So you really need to have a proper precision GPU instance that can actually be optimized for those workloads. So again, access to those freely. Well, freely. You need to pay, but you still have access to them.

So accessing to them actually also brought us to possibility of being trained in those stuff. And then open source software. such as Hagen-Face, so if you didn't use yet Hagen-Face, whenever you start doing anything regarding LLMs or even classic NLP, such as sentiment analysis, you just go to Hagen-Face. Hagen-Face, it's actually a hub of data sets, so all regarding data sets and also models. Why would you even go first to look at Hagen-Face?

Because, for example, it can actually give you understanding of type of the models. We're gonna talk about it later, but you're gonna see there is literally like dozens of them and it's going to grow. So you need to understand which model and which use case can actually correspond to each other, right?

What can you use? What are they covering for? The many languages, all of these model work mostly on languages, well, of course images, but like how they work, right? And so technology. So technology basically been evolving and one of the most kind of crucial actually technology that help those Gen EIs to come through is transformer technology.

Not gonna go. to deepen it, but I recommend you to watch a paper or at least some videos about attention mechanism. So that attention mechanism is basically what actually helped to have this breakthrough in terms of technologies. And then the last but not the least, this technology is actually R&D still and it's really like state of the art for some companies.

Is this reinforcement learning with Human feedback loop. In reality, officially that's what kind of open AI claims to use under the hood, but they even say that it's very complicated and error-prone, meaning okay we're using it to make you pay so don't use it so that we can get more money. So I think it's gonna evolve more and more soon but it's true that today if you want to know how it's working or how to use it that's not going to be really easy and you would spend time to understanding all of those things.

The next step is actually that those models getting smaller and smaller. What does it mean? It means that they need to learn less and less. What does it mean for you?

Less to spend time on preparing data sets and also less time of waiting while they train and also less time on actually paying, right? So less time equal less pay if you pay for GPUs. So they're getting smarter and smarter thanks to different techniques and then they're getting smaller. So meaning faster in usage, faster in fine-tuning if it's required, and then faster access to them. So let's go and check actually a few use cases.

Probably those are like the most common one. They've been there forever, right? So one of the most common that we all actually know, Q&A. Chatbot's been around for literally ages, so some of them been smarter, some have been less smarter, but they're getting better and better.

So probably those are the most common use cases that today a company wants to implement actually inside the organizations. So content generation, question answer, virtual assistance, content personalization, language style transfer, stolitary, poetry translation, and code generation, auto-completion. So I think something interesting to cover is about the code generation and auto-completion.

On top of that, you can also add not only just code generation and auto-completion, but you can also, for example, think of migrating, right? So let's say I'm a data scientist. I work as an ML engineer.

And sometimes my customers using Scala, and I personally have to say I don't really like Scala a lot, and also there's not a lot of people that still using Scala on the market. A lot of companies pay into other companies to migrate them from Scala to PySpark or something else. And so, for example, the technology like that could actually, you could train a model to teach it how to migrate from Scala to PySpark, right?

So those technologies are already also there, and they're going to be more and more. Oh, sorry, I just wanted to give an example. This is about more content creation, actually.

So... I think a lot of people that may be working in art or news may be a bit afraid. I give an example. You know, in my company, we're writing sometimes blog posts, and I've been recently, like, publishing one. And I've actually been struggling, you know, to start an introduction.

It's always complicated to start from something. So I just went to chat GPT, and I asked him, can you please give me, like... the most common technology around serving and what's I used on the market, and write me why is it so cool to use serving in real time. And so he wrote me some text, so I look at it, and I just went and changed it.

So basically it helped me to basically give me that spark to start faster, right? So I think that's gonna be also, well, it's already used, but it's gonna be used more and more by content moderator and content creator, just cuz it can be actually a very great tool to start. In any case, all of these models train from human knowledge and human interactions, right?

So probably model yet cannot be better than humans, in my personal opinion, so there is still a lot of options for creation, but it's a great start for a baseline. So an example here, it's the model being asked to generate a text to kind of, in a funny way, to prove you that Gen-EI, it's a very good technology. So the sentence that I like a lot, it's...

It's not just about flying cars and robot butlers. It's mind-blowing technology that can compose symphonies, craft TV jokes, and design cutting-edge fashion trends. So, yeah, it's been kind of creative, right?

And also it wrote it in a funny way. So it's important to understand that they are also very creative, and they can follow what you ask them, right? If you want it in a funny way, then it's going to be that. So the next... kind of understanding that we need to know before actually going to use anything, I would say it's to understand those LLMs versus foundation models, right?

So foundation models are those huge models such as a chat, a GPT-4 that is opening. I kind of use it under the hood. Then there's BART from Google that came recently. Then there is MPT-7 billion.

Recently, Mosaic released MPT-30 billion, which has actually been proven being even better than GPT-3, which is actually a huge model. So you need to understand that basically the difference is that you can just take a foundational model and just start using it. So you don't really need to actually kind of officially fine tune it.

So they are there for being. For being used by everything? Well, LLM models can really vary, okay?

They can be smaller, bigger, and it's really like anything that's learning based on these transformers can be called LLM. So now let's just talk a bit for people that maybe don't know anything, what's under the hood, so how it works actually. So this is an example about the text input.

So here the text input is Dolly is a language model that can help you generate text for a variety of tasks, okay? So how would it work? For example, you want to actually generate the text after.

In reality, what those models do, they try to generate the probability of the next word to come next. Okay, that's it actually. So the model would generate you some probability that's coming next.

So how it works actually under the hood. In reality, the whole process you can probably separate into three components. So the first one is actually what we call encoding.

So all computers... don't understand text, they don't understand audio, they understand bytes, so it's zero and one. So you need to actually let the model and the computer to know what's actually going on under the hood.

So for this, you actually need to encode the input to something that the algorithm can actually understand. So this is the part where we call it tokenization. So you need to have a tokenizer that actually gonna take a text. The text is gonna be chunked into tokens. So this is the text that is highlighted.

Then those tokens that are still words actually convert it into numerical numbers. And then the next step is actually to convert the numerical numbers into something from what model can learn. So in reality, the breakthrough in technology is happening from the embeddings. So what is embedding?

Embedding is when you take a one number, for example, take a first one, 35. And this is actually going to be represented as a huge vector in space of a certain size. And so each of those numbers actually projected and represented as big vectors. OK, so the classic one is like something like 728. So there is just going to be a huge vector in n dimensional space.

And in reality, what the model is going to learn from is from these huge matrix in n dimensional space. OK, so just going to be a lot of numbers that projected. And then there is this pre-trained foundational models.

So those models, they contain those attention mechanism, and those attention mechanism will find some similarity and patterns in order to predict the next probability. So the last step in the model predicted the probability. You know, we're humans, so we don't actually care about probability, right?

We just want to get a word. So that's where the decoder part comes from, right? So once the model is done, we just need to decode it back. And so the example here is what's the probability of having the word... Something after this so is is having a higher probability than anything else to have the word like is there Okay So now you understand there's a lot of tokens So basically we speak a lot and then various languages and plus if you want to add the code and all of that thing so What are those numbers stands for?

So those numbers stands for parameters that actually model learns from right so you can actually enter a lot of things into them So an example, Falcon 7 billion or 40 billion. It's literally the amount of parameters that is under the model. So bigger the amount of parameters, heavier the model in terms of memory, longer it's to train it and also longer is to actually get something from it because well, it has to actually go and look for all of these patterns.

So each model may correspond to different use cases. This is important to understand. One of the first and actually it's the last, but not the least in my personal opinion, I would still use it for very simple sentiment analytics, it's BERT.

BERT model been used by almost probably everyone in any industry, so it's still used and it's going to be used because it's small, it's reliable, and you can fine tune it anytime. BERT model is a... pretty simple model it's actually very good for classification of tokens or the whole sentences probably if you want to classify a sentiment of a twitter or something as a comment in your company you would definitely use it then there are other models so for example blue model trained on 46 languages so if you require a model that train on a lot of languages you could also try to use bloom What is important to understand next is licensing.

Okay, so licensing, they are not open for commercial use. Not all of them are open for commercial use. So what does it mean? It means that you cannot just take it and kind of go and make money from it. Okay, so be careful about the commercializing.

So Databricks Dolly is there. So on top of Dolly, just like a model, right, we also release the data set. So it means that you can actually take it, enhance it and reuse it for your business. So it's also important because you need to have as many data sets as possible. I think I talk about the use cases.

The use case that I would like to start probably, it's about summarization and the named entity recognition. So named entity recognition actually used a lot for the pharma, for the bio studies, and also, for example, for content moderation. So for example, if you have some comments that's coming to your company web page, and maybe you have regulations that actually require you to check if there is some toxicity or if there has been a person or company mentioned, you would actually need to predict what's under this text are.

And right now, if not, you would just do it manually. And so named entity recognition helps to actually automatize it. And then the summarization, I think some of my colleagues would say it's kind of tricky how you summarize stuff.

But you're going to see an example. It's very helpful when, for example, you have a lot of text and you just want to put them under some bullet points together. And this is actually an example of summarization.

What are the top five customers'complaints based on the provided data? And then the model basically analyzed all the customer review data sets and came up with the five bullet points. So shipping delays, product quality, customer service responsiveness, billing and payment errors, and order inaccuracies.

So I think this is a great use case of showing the reality how, like, where this will go soon, right? So soon you will just get a phone that may be connected to some of your, I don't know, Company database you just want to know for example. What's been our sales?

I don't know for the five days I think it's been actually as an example in the keynote and it's basically Ways come in where it's coming soon on top of that also It's about the feedback analysis and the virtual assistants that actually going to help you boss Internally in a company or actually externally so there are way more use cases But for example one of the use case that I've been actually seen for a long time before charge EPT came. It's actually this automated customer responses. So there may be some, you know, when we're working with customers, there's definitely a different level of risk or priority in terms of answering.

And maybe if the problem is big, you may just directly send some promo code or maybe send some particular emails. So those bots are very powerful in terms of that, and there have been a lot of companies doing it. So I'll already give you an example actually about...

The code generation or the code that can be used for migration, right? So there are a lot of things, for example, also, I think one of the breakthroughs I'm hoping for is to write automated testing and also documentation, because that's definitely something I hate to do, so I would pay for someone doing it for me. So definitely looking forward into technologies that's gonna do a right code documentation, automated testing, and definitely convert code between different languages.

So we already discussed that there are different models, right? So some of them are proprietary and some are actually not. So there are positive and negative in both, so we're going to discuss them.

So usually what is proprietary? It's something that's offered as a service. So, for example, chat GPTO bar, you just kind of, well, you subscribe, you can pay for subscription, and then, well, you just send some request, right?

It can be an API, it can be just a UI, and basically you just kind of interact with it. So fine-tunes, it's more like off-shelf models. It's something that you would probably train in your organization, and then it would be for you to use some service to actually expose it to business users and to whole organizations and to your customers.

So some of them are commercial, non-commercial use, while definitely proprietary models are all for your use. So it's leading us to actually four bullet points that people are asking about, and also where we see... you should take care and think through about those things. So first of all, it's privacy. So when you're using a commercial model, where is your data actually, right?

So when you send a request and you ask something, where does it go? So does it keep your answers? Because I can tell you I use ChatGPT for my personal usage and when I go back, I always see whatever I asked.

So if someone one day hacked my account, they will know what did I ask. So the quality. How actually you measure the quality?

And is... the quality of a foundational big model is actually as good as if it would be a personal tiny model just for use case. So this is also important to take into account. Then the cost. In reality, when people think of cost, when you need to fine-tune and find a model of that, they would say, oh, it's actually expensive because I need resources, I need people.

But what you need to understand is that usually this pay-as-service thing, you pay per request. So if you actually... ever played with ChatGPT, you actually need to be very smart the way you use it, right? Because normally when you ask him to do something...

There may be a lot of use cases where it actually doesn't work. So for example, you ask him to generate a code. So you took a code and then you try it, and then it doesn't work.

So you would take that code and actually go back to ChatGPT and say, oh, that part doesn't work. This is the error I get. So he will actually go back and try to change it.

So for each of these requests and iteration, you would pay. So imagine for example, if some of your colleagues or someone just breaks through your system and they're just sending like thousands and thousands of the questions to them. So this is up to you to pay. So the question stands, even if you're using it, how do you actually keep track of the cost, right?

So how can you actually monitor, but also can you actually limit the amount of requests that people send in? And then the latency. So if you apply those models to customer services, even internally, but especially to customer services.

your customers probably would not wait for hours, right? So those models are very heavy. They are going from hundreds of gigabytes sometimes or thousands of gigabytes, so you need to be sure. So it's not like something, you know, you just take a small container, you just throw it to that, and it's just powering your chatbot. So how do you actually assure that latency is being met, right?

So how you know that it's actually answering very quickly to your customers. So coming back to the kind of pros and cons for BOSS. So again, this is on average. Probably we can all come up with more or less, twice more of these pros and cons. But usually big models are smart, okay?

They've been trained on really like terabytes and terabytes of data. They read like thousands of books. So they're smart.

They know a lot of things. You need to know how to use them. So even if... you kind of deploy those models in your company, you still need to enable your team to understand how to actually ask those questions. And so since ChatGPT is such a huge model, it may take also a very long time of actually enabling people how to use it, right?

So then it's about the cost, right? So it's costly, okay? So you need to ask staff.

And then there is data privacy and security. So where is the data? Can it be leaked?

So how secure is that? And then vendor lock-in. So imagine if in three to four months, this is the... There may be something better that arrived than open ei and then it's cheaper and then maybe it's more It's better quality. So what do you do?

So this is also something to take into account Then talking about pros and cons for open source models so the Positive is that you can task tailor them. So even if it's a small model you can actually really Apply all your company knowledge. Okay, all the documentation that you have to particular model data stays on your site So you don't actually need to share it with anyone The negative part is that yes, you it requires your front-time investment and you require data Okay, because the garbage in garbage out that is bad.

There is nothing happening without it. So you still need some data Maybe it's not millions, but you need at least thousand examples, right? And then the skill set, so you still require some in-house expertise to deploy those models and to create them. So for people that maybe don't know what fine tuning is, I just wanted to explain you the difference.

When you're using ChatGPT, the model is already ready for you, so you're just asking questions. In reality, even ChatGPT, when you're working for your personal data doesn't have any idea about your company right if I ask today chad GPT generate story about me he will generate a story about me but it's not about me just about something that he generated content So in order to improve the model quality and actually bring those knowledge into the model, you need to fine-tune it. So basically inject the new data. And so that's what we call fine-tuning, when you get a foundational model and then you actually get this fine-tuned model.

So it doesn't require as many computational resources because actually model is already initiated. You just need to update weights with your knowledge, right? So that costs less.

So those models... Better to be fine-tuned on your data, right? So let's say if you work in legal domain or in scientific domain, you need to detect those models, train those models actually on a domain that corresponds to you. I already gave you an example about BERT models, so you can find a variety of FinBERT, GeoBERT, BioBERT.

Like, there are a lot of stuff that this model has been trained on, actually. And so this is an example, for example. So why...

People care about those open source models. Okay, because you can say okay. I don't know. I don't care I just want to pay so it's fine for me. So again The technology evolving okay, just less in than in five months.

We already get a lot of open source Models and for example here it's mentioned mosaic mp7 billion, but just a few weeks ago to get sorry billion That's actually like here, Falcon says that they have 75%. MPT starts to build in claims that they actually surpass in GPT-3, which is actually a very heavy model in terms of fainting. So those...

commercially open models are coming more and more often and it's going to be more and more technologies that's coming so it's going to be more and more accessible. So what next you need to understand is that you will probably never use a single model meaning you would need to use more than one and I give you an example that I love a lot it's like the simplest but the most powerful one. You work in international company and you have a customer service and it's an international one so you probably get emails in different languages so what usually I see when what people are still doing is that when you have an email complaint People would actually go and translate it manually so each email would be translated and then they would try to analyze the complaints They may store some or some bullet points They would answer to that complaint and actually what's happening is that often? Less than half of complaints or less than half of this situation can be actually properly treated and analyzed okay, so Customers becoming angry and maybe your manufacturing continue to produce something that's malfunctioning etc etc So in order, for example, to fix that, you would actually create a few models.

So first model would actually detect the language and then translate it to English. Just because it's been actually one of the most common language to train all of this, so probably going to be more knowledge in those models. Then you would actually try probably to summarize or do some named entity recognition and actually try to understand the pattern.

And then when you understand it, you can classify it. And when you classify it from output of classification, you can actually decide which type of text to generate. Or you may actually just reuse the proper template that you had and inject some data into it.

So I already explained how you can use at least five models together. So imagine you need the five models together. And again, if you're using a commercial model, you would actually pay a lot of money.

and maybe not all of them will even work for that. So you will still need to actually create a fine-tuned one, okay? It's important to understand some models that are commercial, like use-as-a-service type OpenAI may be super good for certain use cases, but for certain, like for other use cases, you may still use a very tiny and simple model type BERT or maybe something bigger, like for a few billion parameters that are super fine.

super simple to fine tune and then you just don't need actually to pay for that like a lot of money to just send some classification requests so probably also the keynote so i will not stop too much in terms of databricks offerings but i just wanted to explain you things that you need to understand when you deploy anything around llms that does not really actually kind of relate to any company right just what we came up in terms of let's say data platform architecture so usually multiple questions that we hear and I've been talking to a lot of customers so probably it's going to be like how do we Customize those models if I want to customize so how the insurance actually whether the connection is secure then how I deploy them because that's not just like a 50 megabyte model and I cannot just simply download it and put it in my docker container or whatever I want So how are you sure that they're delivering good results and they're not actually creating some ethical or bias answers? So it's also very important. So how do you integrate your data governance? Even though you use as-service models, you still need to actually govern the data because, well, you want probably to know what people are asking, and you probably want also to know what actually it's answering.

So you still need to govern data, right? And if you fine-tune your model, then, again, you need to govern the data. because while there's a lot of legislation, hopefully we'll have time at least to pass through some of them, but if the company being asked to show what actually you use to train this data because maybe someone claiming that you actually leak their personal data, you need to have data governance, okay, so you still need to know what actually the model used for and the issue is all of this as service models.

They cannot actually guarantee you that they never used any data that maybe relates to your customer information or something like that, because they just scrubbed data from the Internet. So they could also use some LinkedIn information, all of that. And so, yeah, maintain the flexibility.

As I just showed you, there are already like three or four models that are available. There are going to be thousands of them. So what in reality you need to understand is that you need to have an infrastructure that you can just plug and play with those models.

You don't need to rebuild everything all the time when you have a new use case. So a few years ago, well actually it's been already more than eight, but Google came up with the statistics where they measured how many time it requires a data scientist to spend for different tasks. And so the statistics showed that ML engineers or data scientists in Google spend less than 10% of their time doing actually ML code. So everything around that's called infrastructure takes 90% of people's time.

So this is where we need to understand that models which is OpenAI. It's just a model, okay? You can use whatever model you want.

But you still need to build all the infrastructure around it. There is a lot of things that has to be taken care of. Your data platform that creates... your data governance in the lineage and then you also want to have this data platform that can actually work not only for LLM but for other use cases right you still have other use cases you have your BI you have your ML DL whatever there's still a lot of use cases you need to have a platform that fits them all Then you need to prepare your data. What you need to understand that for LLMs, it's not just the text, right?

The text has to be prepared into those embeddings. How do you store a vector of 760 size, right? How do you store those multi-embeddings? So normally we store them in vector databases.

What is vector database? Where do I get a vector database? Is it just the same vector database as others? So, right, all of these products are evolving. So we need to understand that.

It's not just a general data platform anymore. Then even if you don't fine-tune, you still actually need to evaluate model. You need to monitor it.

You also need to monitor costs. So you need to be able to kind of fix the amount of requests that people have in. And then you also need to actually connect your model to your data because, again, there's a lot of policies.

And so you need to be sure that those policies can be respected. And if someone asks... Have you actually used my personal data to train models? You need to know that that's actually not the truth. So you've already probably seen that slide, so I will not stop a lot.

But for example, what Databricks is actually coming up this year, it's things that or augment the capability of the platform to integrate LLMs or be coming actually with new products. So for example, a vector index. Managed vector database that can actually help you to enhance your data in real time and also store and share your embeddings Okay, because it's very important your model will actually be powered by those so marketplace where we're going to actually share best kind of Models for you so that you don't have to go look yourself or download them They're all going to be under the platform and AI get away.

It's actually the tool that's going to help you to restrict those requests So as I said again, the vision is that you can actually use any model, okay? Whatever the model infrastructure is going to be the same. I will just go through those slides just because we have a limited amount of time. But what I wanted to show you again is that the models are just one single part, okay? So everything around is still stairs.

You still need to serve them, you still need to monitor them. So one of the announcement that today is going to be a very big breakthrough actually on the platform is that we unifying our feature store and model registry with the Unity Catalog. So now your model's gonna be very simpler to be linked to your data in terms of governance, also to your code, to your parameters. So now you actually don't even need to bring your model away, okay? Because in Unity Catalog, you may actually have multiple catalogs.

that can be for development, productionizing, staging, and all of that things. And so your model lives there because your LLM ops will not be anymore the same, right? Those model, you're not going to retrain those models for each week, et cetera.

Those models train less often. And then it's important to understand actually that you will not be able to deploy them the same way as any classical model. So how can you prepare for this revolution?

So act with urgency. Understand the fundamentals. Develop strategy within your company because it's important, right?

Databricks, OpenAI, or no one else will develop your strategy because you know your business. You need to go talk to business users. You need to identify the highest use cases.

So this is something very important. Try to identify the use case that can be done efficiently and then can also bring you a real value so that you can actually, let's say, get everything back from it as soon as possible. So train the people, not only in terms of training models, but also how to use them.

It's important so that people understand what those type of technologies can actually give them and where they can actually be used. So I'll give you a very quick example. Now I'm using Google Workspace when I'm taking notes.

So for example, I have a call with my customer. And Google now has this import meeting note, and actually it's giving me all the emails and all the participants. So whenever I finish with my notes, I can just directly send them the email.

So now I don't have to actually go back, look for all of these names, write them in email, et cetera, et cetera. So this is like those simple but very powerful accelerations and automations that can actually bring... bring your business to some really acceleration point. So those are kind of five pillars, right? So again, define the strategy.

It's very important to understand because some people would come and say, oh, we identified 500 use cases. Let's do them all. So that's not a great strategy.

Probably a better strategy would be, okay, I've identified it, 500, but those are the priorities. So let's start from these two or three or five use cases. So it's very important.

So again, talk to your business. Business knows usually where it can be accelerated and even accelerating HR department on 50% can actually bring you a lot of value, right? So designing architecture, be sure that the platform you're using and the way you actually architect your data platform is good because what we see now is that there are not a lot of companies that are ready for AI. And if you're not ready for even simple AI, then being ready for Gen AI is going to be even more problematic, actually. So identify good design and patterns in your architecture because that's definitely important.

So operations and monitoring, again, align your operational model with actually all of this Gen AI. And then people in adoption, right? Training and support, defining role and responsibilities because in any case, Gen AI is going to actually change away how we think. So very quickly, few things you need to know about those models.

They actually hallucinate a lot. So there may be some ethical issues and also privacy and security issues. What they need to understand is that those models learn from the data that is outside, right? So they actually don't know about something that's been less represented than the most of the data. Meaning if everything is written that the classical male that the classical person that works in IT, it's a white male, then the model in most of the time will learn about this white male and whenever you ask to generate a content it would use this as an example.

Okay, so it's very important. So I just wanted to very, yeah, so about data privacy also anonymize your data, encrypt your data, be sure that you have a proper access control because this is very important you don't use private information because those models don't forget. Once you put it inside a model that's done. Okay, so The only way you can do is to remove the model and actually make a new one. So one of the examples, if you didn't hear, it happened to Samsung, okay, someone used chat GPT and then data been leaked.

And recently I even heard there've been some passwords leaked. So data can be leaked from private services. So that's also one of the reasons that a lot of companies don't want to use private hidden services. They want to have their own.

So those models are very kind. Literally. They want to help you to learn and to get the data.

So often those models kind of have those filters and limitations. So this is a great example. He asked to get a torrent website to download the legal content.

And it's saying no, no, sorry, I cannot do that. And then he said, okay, how about which websites I should avoid because they're against copyrights. And he's just giving them all.

So there have been a lot of use cases like that, so you also have to be sure that no one will use those models to actually harm your business, okay? So this can be also possible, right? Someone break into your system and they make some harmful things and they pretend to be your company, so that's also important. So there are also security threats, right? So it's important to take them into account.

I just wanted to really go quickly. Yeah, so there's also regulatory areas. So you need to be sure that things that you're using are going to be compliant with your country and your government. So you also need to follow that because that's what's going to happen.

So there is this human bias in data. I think this is really important to understand how you interact with these models. So I really wanted to explain to you very quickly what is this human interaction loop, okay? Whenever you have a model, you will get an output of your model.

Okay, so humans still very required whenever you get an output of model You actually need to check whether there been some toxic content There've been toxic answers and what's actually been happening then the person that's checking it It can be a model that can flag some things Okay for toxicity and then this toxic flags can actually be sent to human and then the human will actually Correct it and you will inject this data back to your model And so this is the way that you can actually reduce the bias but also error-prone model answers. It's not only about that the model doesn't know something about the minority that we presented, but also about just the good or bad answers. So those models hallucinate a lot.

So you need to understand it. What is the hallucination? So a very big, my favorite example.

So the first Ebola vaccine was approved by FDA in 2019, five years after the initial outbreak in 2014. And then the output that the model generates is that the first Ebola vaccine was approved in 2021. So it's kind of generated something, but it has nothing to do with the reality, right? So because the proper answer is 2018. So what probably happened is that there have been some things happening in the Internet in 2021 regarding Ebola. So most of the things that the model learned was about 2021. And so the model just decided, okay, so that's been the proper answer. And the next one is even worse. It's actually creating something that doesn't exist at all, right?

So at least one first price in fencing last week And then the model decide that at least one first price fencing for the first time last week and she also was ecstatic I don't know if model maybe decide she was also beautiful But basically this is also a harmful generated content. And so this is not something that should happen, right? So how to address those ethical issues? So first of all again slice your data, check your data, at least what's the input data, so update the data.

Then you can actually have models that detect this toxic, discriminatory, and some exclusive models. So you can create filters based on those models. Then you can actually have a fine-tune.

So you would fine-tune your model, and then all of this actually will lead you to regulations for your models. So it's still important to understand that just a random model will not actually comply with them. So this has actually been all. So as the next steps, Databricks published an EDX class that's already available.

If you have access to Databricks Academy, you can also follow that class. The second one is going to be published at the end of summer. There is also a lot of blog posts and solution accelerators that Databricks have been publishing recently.

So don't hesitate to go check our blog page, our demos that's also available under the blog page, accelerators that shows you the code and everything. And if you want to know more, watch the on-demand videos after Dice. Okay, that's it. Thank you.

Sorry, I've been late for four minutes. Thank you, Anastasia. Thank you so much. So thank you all, everybody, for coming.

And yeah, you guys did it. This is probably the last session of this summit. Thank you so much.

We will be trying to close up this room, so if you can please take your conversations outside the room, that will really help the folks who are managing this room to close up quickly. Thank you.

Transcript for:Generative AI Training Session Notes

Transcript for:
Generative AI Training Session Notes