Generative AI Training Session Notes

Jul 25, 2024

Generative AI Training Session Notes

Introduction

  • Speaker: Anastasia, part of the specialist team at Databricks
  • Location: Paris
  • Background: AI researcher, expertise in big data and geospatial data
  • Certification: Recently passed a certification in 28 minutes
  • Session Goals:
    • Understand Generative AI (Gen AI)
    • Discuss use cases and challenges
    • Overview of the Databricks vision and ecosystem related to AI

Goals of Presentation

  1. Address concerns around Gen AI as a threat to organizations.
  2. Explore how Gen AI can help businesses gain a competitive edge.
  3. Understand data security when using proprietary tools.

Agenda Overview

  • Basics of Gen AI
  • Common applications of Gen AI
  • Preparation for adopting Gen AI
  • Ethical and legal considerations

Understanding Gen AI

  • Definition of AI: Mimicking human thinking.
  • Machine Learning (ML): Analyzing data to find patterns.
  • Deep Learning (DL): Mimicking neuron connections to transform and analyze larger sets of data.
  • Gen AI: Advanced form of DL requiring vast datasets.

Historical Context

  • Gen AI technologies have existed for a long time (e.g., Siri, Google Assistant).
  • Recent advancements in accessibility, data availability, and open-source technologies drive current hype.

Computational Resources

  • Need for High Power: Training models like GPT-3/4 requires significant computational power often provided by cloud services.
  • Open Source Software: Usage of frameworks like Hugging Face for access to datasets and models.

Use Cases of Gen AI

  • Common Applications:
    • Chatbots and Q&A systems
    • Content generation
    • Personalized assistance
    • Code generation and migration (e.g., from Scala to PySpark)

Content Creation Example

  • Use of ChatGPT for writing blog posts and generating content ideas.

Exploring Models

  • LLMs vs. Foundation Models:
    • Foundation models (e.g., ChatGPT-4) can be directly used without tuning.
    • LLMs vary widely in scale and purpose, influencing their use.

Model Mechanics

  • Encoding: Input text converted into tokens and numerical representation (via tokenization and embeddings).
  • Attention Mechanism: Key breakthrough that enables models to learn patterns and relationships.

Parameters of Models

  • Models with larger parameters generally require more resources, impacting their training and utilization time.

Model Licensing and Governance

  • Difference Between Proprietary and Open Source Models:
    • Proprietary: Commercially available, often with usage fees (e.g., ChatGPT).
    • Open Source: Customizable, data privacy maintained, but requires time investment.

Ethical and Legal Considerations

  • Risks: Data privacy, security concerns, and potential for model bias.
  • Human Bias: Models may perpetuate biases present in training data.

Steps for Effective Deployment

  • Strategy Development: Identify priority use cases in collaboration with business users.
  • Operational Alignment: Ensure that your organizational model supports Gen AI integration.
  • Training: Equip staff with skills to effectively use Gen AI tools.

Practical Considerations

  • Models can hallucinate, producing incorrect or misleading outputs.
  • Importance of human oversight and the feedback loop for monitoring output quality.
  • Data governance is crucial to maintain compliance and protection of sensitive information.

Conclusion & Resources

  • Databricks Initiatives:
    • New offerings to enhance LLM capabilities and governance features.
    • Databricks Academy for more educational content.

Acknowledgments

  • Appreciation for participation and engagement in the session.
  • Closing remarks encouraging further discussion outside the room.

šŸ§