Generative AI Training Session Notes

Jul 25, 2024

Generative AI Training Session Notes

Introduction

Speaker: Anastasia, part of the specialist team at Databricks
Location: Paris
Background: AI researcher, expertise in big data and geospatial data
Certification: Recently passed a certification in 28 minutes
Session Goals:
- Understand Generative AI (Gen AI)
- Discuss use cases and challenges
- Overview of the Databricks vision and ecosystem related to AI

Goals of Presentation

Address concerns around Gen AI as a threat to organizations.
Explore how Gen AI can help businesses gain a competitive edge.
Understand data security when using proprietary tools.

Agenda Overview

Basics of Gen AI
Common applications of Gen AI
Preparation for adopting Gen AI
Ethical and legal considerations

Understanding Gen AI

Definition of AI: Mimicking human thinking.
Machine Learning (ML): Analyzing data to find patterns.
Deep Learning (DL): Mimicking neuron connections to transform and analyze larger sets of data.
Gen AI: Advanced form of DL requiring vast datasets.

Historical Context

Gen AI technologies have existed for a long time (e.g., Siri, Google Assistant).
Recent advancements in accessibility, data availability, and open-source technologies drive current hype.

Computational Resources

Need for High Power: Training models like GPT-3/4 requires significant computational power often provided by cloud services.
Open Source Software: Usage of frameworks like Hugging Face for access to datasets and models.

Use Cases of Gen AI

Common Applications:
- Chatbots and Q&A systems
- Content generation
- Personalized assistance
- Code generation and migration (e.g., from Scala to PySpark)

Content Creation Example

Use of ChatGPT for writing blog posts and generating content ideas.

Exploring Models

LLMs vs. Foundation Models:
- Foundation models (e.g., ChatGPT-4) can be directly used without tuning.
- LLMs vary widely in scale and purpose, influencing their use.

Model Mechanics

Encoding: Input text converted into tokens and numerical representation (via tokenization and embeddings).
Attention Mechanism: Key breakthrough that enables models to learn patterns and relationships.

Parameters of Models

Models with larger parameters generally require more resources, impacting their training and utilization time.

Model Licensing and Governance

Difference Between Proprietary and Open Source Models:
- Proprietary: Commercially available, often with usage fees (e.g., ChatGPT).
- Open Source: Customizable, data privacy maintained, but requires time investment.

Ethical and Legal Considerations

Risks: Data privacy, security concerns, and potential for model bias.
Human Bias: Models may perpetuate biases present in training data.

Steps for Effective Deployment

Strategy Development: Identify priority use cases in collaboration with business users.
Operational Alignment: Ensure that your organizational model supports Gen AI integration.
Training: Equip staff with skills to effectively use Gen AI tools.

Practical Considerations

Models can hallucinate, producing incorrect or misleading outputs.
Importance of human oversight and the feedback loop for monitoring output quality.
Data governance is crucial to maintain compliance and protection of sensitive information.

Conclusion & Resources

Databricks Initiatives:
- New offerings to enhance LLM capabilities and governance features.
- Databricks Academy for more educational content.

Acknowledgments

Appreciation for participation and engagement in the session.
Closing remarks encouraging further discussion outside the room.

🧠

Full transcript