Transcript for:
Comprehensive Guide to Database Concepts

First Test First Test Relational Databases and Normalization Definition and History of Databases * A database is an organized collection of data, typically stored and accessed electronically from a computer system. * The history of databases dates back to the 1940s with the development of early calc computers, ram, binary only. * Late 1950s-mid1960s files , flat files:no structure and hierarchical; limited ability to modify, time consuming * Network databases, leading to the emergence of relational databases in the 1970s. * Relational database- related tables of data E.F Codd developed relational database while working at IBM, contained in sets * Early databases were primarily file-based, which limited data retrieval and management capabilities. Relational databases revolutionized this by allowing data to be stored in tables that can be easily queried. Types of Databases * Files: Basic storage units that hold data without any structure or relationships. * Hierarchical Databases: Organize data in a tree-like structure, where each record has a single parent. * Network Databases: Allow more complex relationships with multiple parent-child relationships, resembling a graph structure. * Relational Databases: Use tables to represent data and relationships, enabling powerful querying capabilities through SQL. Normalization and Its Importance * Normalization is the process of organizing data in a database to reduce redundancy and improve data integrity. * It involves dividing large tables into smaller, related tables and defining relationships between them. * The main goal is to eliminate insert, update, and delete anomalies that can occur in non-normalized databases. * Relation- a table, row represents record, column = attribute or field Normal Forms * 1NF (First Normal Form): Ensures that all columns contain atomic values and each entry in a column is of the same type. Table is relation (table is a relation) * 2NF (Second Normal Form): Columns dependent on candidate keys * 3NF (Third Normal Form): all columns dependent on primary key * Insert- cant add data without introducing unnecessary info * Update- need to update info in multiple places * Delete-unintended loss of important info * Metadata- data about data * Entities- object of interest * Attributes- columns in databases Keys in Relational Databases * Candidate Key: A set of columns that uniquely identify a record in a table. * Primary Key: A candidate key chosen to uniquely identify rows in a table, cannot contain NULL values. * Foreign Key: An attribute (column) that creates a link between two tables, referencing the primary key of another table. * Alternate Key: any candidate key of a table that is not chosen as the primary key. Structured Query Language (SQL) Overview of SQL * SQL is a standardized programming language used for managing and manipulating relational databases. * It was developed in the early 1970s by IBM and has since become the industry standard for database management. * SQL allows users to perform various operations such as querying data, updating records, and managing database structures. SQL Syntax and Clauses * DDL (Data Definition Language): Used to define and manage all database objects (e.g., CREATE, ALTER, DROP). Deals with structure * DML (Data Manipulation Language): Used for managing data within schema objects (e.g., SELECT, INSERT, UPDATE, DELETE). deal with contents of the database * The SELECT statement is the primary command for querying data from a database. SQL SELECT Statement * The basic syntax of a SELECT statement includes clauses such as SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. * Example: SELECT column1, column2 FROM table_name WHERE condition; * Understanding the output of a SELECT statement involves knowing how to read and interpret the results returned by the query. SQL Functions and Clauses * Aggregate Functions: Functions that perform a calculation on a set of values and return a single value (e.g., COUNT, SUM, AVG). * GROUP BY Clause: Used to arrange identical data into groups, often used with aggregate functions. * ORDER BY Clause: Specifies the order in which the results are returned, either ascending or descending. * WHERE Clause: Filters records based on specified conditions, utilizing Boolean logic to combine multiple conditions. SQL Joins * Inner Join: returns only the rows where there is a match in both tables based on a common column. * Left Join Returns all rows from the left table, and matched rows from the right. If no match, right side will be NULL. * Right Join Returns all rows from the right table, and matched rows from the left. If no match, left side will be NULL. * Full/Outer Join Returns all rows when there is a match in either table. If no match, shows NULLs on the missing side. Syntax of data example: burth_date is not birth_date Boolean logic= or,and,etc syntax error= structure is wrong logic error= produces an outcome but it’s not the correct outcome (flaw logic) Sum = select sum (*) Sort = select * from blank order by blank like= helps you match patterns where blank like '%' no duplicates = select distinct blank count= select count (*) trim= use to find data regardless of spaces in between ex; where trim (description) like 'Footprint%' 2nd Test Fundamental Concepts of Relational Databases * Strong Entities: These are entities that can exist independently of other entities. They have a primary key that uniquely identifies each instance * Weak Entities: These entities cannot exist without a strong entity. They rely on a foreign key from a strong entity to ensure their existence. (intersection tables have weak entities) * Parent-Child Relationships: This refers to the hierarchical relationship between strong and weak entities, where the parent entity can exist independently, while the child entity cannot. * parent is primary keys child is foreign keys * Intersection Tables: These are used to represent many-to-many relationships between two entities. Has foreign keys to form it * Cardinality: This defines the number of instances of one entity that can or must be associated with each instance of another entity. * It can be one-to-one 1:1 Example-Each person has one passport, and each passport is issued to one person., * one-to-many 1:N Example A department can have multiple employees, but each employee works in only one department., * or many-to-many N:M Example A student can enroll in many courses, and a course can have many students enrolled. * maximum cardinality * maximum number of rows participating in a relationship * minimum cardinality * minimum number of rows participating in a relationship * non-zero = required cardinality Data Types and Locale Considerations * Common Data Types: Understanding the various data types is crucial for defining the structure of a database. Common types include: * Varchar- variable-length string data (255 bytes) (words) * Int: Integer data type for whole numbers * Float: often named number or decimals (125.74) * Money: Data type for currency values * Blob: Binary large object for storing binary data such as images/files * Locale: This refers to the geographical and cultural context that can affect data representation, such as date formats and currency symbols. Entity-Relationship Diagrams (ERDs) * ERDs: These diagrams visually represent the relationships between entities in a database. The Crow's Foot notation is commonly used to depict cardinality and participation constraints. One line thru is mandatory, 3 lines out is many, circle on line is optional * * Normalization: This is the process of organizing data to minimize redundancy and improve data integrity. It involves dividing large tables into smaller ones and defining relationships between them. Main goal to eliminate update, delete, and insert that occur in non-normalized data. Data Integrity, Efficient Querying, Ease of maintenance * Normal Forms: The process of normalization typically involves several stages: * 1NF (First Normal Form): Ensures that all columns contain atomic values and each entry is unique. Table is a relation * 2NF (Second Normal Form): Columns dependent on candidate keys * 3NF (Third Normal Form): Every attribute that is not part of the primary key is dependent on the primary key * Logical modeling focuses on the design and relationships of the data, is like designing a blueprint for your database. You're focused on what data you need and how it should be organized, but you don't care yet about the specific technology you're going to use ex; entity, attritbute * physical modeling is about implementing that design in a specific DBMS, considering performance and storage requirements. Where you take the blueprint from logical modeling and start making decisions about how to actually store that data in the database system (like MySQL, PostgreSQL, etc.). ex; table, column * Purpose: It's about turning your logical model into a real, working database. SQL Statements and Queries Basic SQL Operations * DML (data manipulation language): INSERT, UPDATE, DELETE: These commands are used to manipulate data within the database. * Insert: adds new records Same column amt ex: insert test (col1,col2) values (1, 'blah') * count mismatch= columns dont have same amt * Update: modifies existing records (w/o where clause updates all rows) ex; update test set col1=100, col2='too' where col1=1 * Delete: removes records (w/o where clause deletes all rows) ex; delete from test where col1=100 * Logically wrong: wrote query right but doesn't return useful information * Three-Way Joins: Involves joining three tables based on their relationships. * Ternary relationship: involves three entities, so the relationship table connects all three entities. Advanced SQL Techniques * UNION Clause: Combines the results of two or more statements, ensuring that duplicate records are removed. Columns must be same amount in order for union to work * Example: select first_name, last_name from students union select first_name, last_name from faculty Order by: Can only be used in what you listed ex: select make,model, order by make,model * Grouping: used to arrange identical data into groups (group by) 1,2,etc. * Aggregation: sum,count,avg,max,min are used to perform calculations on grouped data. Min (must), Max(can) * Subqueries: A query within another query. within the where clause An outer query and one or more inner queries * correlated (dependent on the outer query) KEYWORD: EXISTS * ex:SELECT e.name, e.salary FROM employees e WHERE e.salary > ( SELECT AVG(salary) FROM employees WHERE department_id = e.department_id * Employees column e.department_id was mentioned making it correlated * Non-correlated (independent, inner query doesnt reference outer query). KEYWORD: IN * ex; SELECT e.name, e.salary FROM employees e WHERE e.salary > ( SELECT AVG(salary) FROM employees ); * Employees column not mentioned therefore not correlated * In/Exists: These are used to filter results based on a set of values or to check for the existence of records in a subquery. * IN : in/not in (small) used in non correlated * EXISTS: exists/not exists exists keyword is often used with correlated subqueries to return set intersections – rows that match criteria (large) used in correlated Test 3 Database Structure and Integrity Database Tables and Relationships * Parent Tables: Parent tables contain primary keys (PK) that uniquely identify records * Child Tables: child tables reference these keys through foreign keys (FK). * Business Rules: These are the guidelines that dictate how data can be created, stored, and modified within the database, ensuring consistency and reliability. Database Administration * Database logging is the process of tracking changes and events in the database to ensure data consistency, recovery, and security. * Transaction Checkpoints: These are specific points in a transaction where the system saves the state, allowing for recovery in case of failure. * Rollback Mechanism (Rollback work): The rollback command is used to revert the database to its last stable state, ensuring data integrity in case of errors or failures. * Begin Work: command starts transaction * Commit Work finalizes it, making all changes permanent * Database Integrity Constraints * Constraints: Constraints are rules you set on columns in a table to control what kind of data can go into that column. They help ensure data accuracy and integrity. Ex:Makes sure a column can’t have empty values. * . Common types include primary key, foreign key, and check constraints. database adheres to specific standards, preventing invalid data from being inserted, updated, or deleted. * Constraint Purpose NOT NULL Prevents empty values UNIQUE No duplicate values PRIMARY KEY Uniquely identifies each row FOREIGN KEY Connects tables together CHECK Validates data with a condition DEFAULT Auto-assigns a value if none given * * MySQL Syntax for Constraints: the alter table command is used to modify existing tables, including adding constraints. * Default value:ALTER TABLE table_name * ADD column_name datatype DEFAULT default_value; * Primary Key Example: * CREATE TABLE table_name ( column_name datatype PRIMARY KEY); * CREATE TABLE Employees ( EmployeeID INT PRIMARY KEY, FirstName VARCHAR(50), LastName VARCHAR(50) ); * Foreign Key Example syntax includes: CREATE TABLE table_name ( column_name datatype, FOREIGN KEY (column_name) REFERENCES other_table(column_name)); * CREATE TABLE Orders ( OrderID INT PRIMARY KEY, CustomerID INT,FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)); * Domain(set) constraints: check a value against a set of values * Interval/Range constraints: check a value in a range of values * Referential/foreign key constraints: The foreign key constraint enforces this rule by preventing actions that would leave the foreign key column with a value that doesn’t exist in the referenced table. * Example of assigning primary key * CREATE TABLE Employees ( * EmployeeID INT PRIMARY KEY, * FirstName VARCHAR(50), * LastName VARCHAR(50) SQL Syntax Overview * Data Manipulation Language (DML): Insert, Update, Delete * Insert into (table) (col1,col2) values (1,'hmm') * Update (table) set col1=100, col2='talala' * Delete from (table) where col1=1 * * Data Definition Language (DDL) * .Create Table CREATE TABLE table_name ( column1 datatype, column2 datatype ); * LIKE: create table like- create table schema- NO DATA * -- create table t1 like db.class.t1 ( makes table the same but there is no data just same format) * AS: create table AS - creates table and copies data * -- create table t1 as select * from dbclass.t1 * Drop table: Drop table (table_name) or drop table IF EXISTS (table_name) * declaration of default values using alter table :ALTER TABLE Employees * ALTER COLUMN JoiningDate SET DEFAULT CURRENT_DATE; * Creating Views: Views are virtual tables based on the result set of a SQL query. They can simplify complex queries and enhance security. Use of database views:Views are often used to simplify complex queries by storing commonly used SELECT statements. This can make querying easier for users who need to perform repetitive tasks. * Example syntax: CREATE VIEW view_name AS SELECT column1, column2 FROM table_name; Database Concurrency Management Understanding Database Concurrency * Definition: Database concurrency refers to the ability of multiple users to access and modify the database simultaneously without interference. * Issues: Common concurrency issues include lost updates, dirty reads, and deadlocks, which can compromise data integrity and consistency, simultaneous access to same record, update errors, read errors. * Management Strategies: Full-featured relational databases implement various strategies to manage concurrency, including locking mechanisms and transaction isolation levels. * want database to have consistent valid state ACID Principles in Database Transactions * ACID: This acronym stands for Atomicity, Consistency, Isolation, and Durability, which are the foundational principles ensuring reliable transaction processing. * Application of ACID: Each principle plays a critical role in maintaining data integrity during concurrent transactions. Manage concurrency issues * Atomicity ensures that all parts of a transaction are completed successfully or none at all (if group fails all fails) * Consistency: transaction will move from database from one valid state to another * Isolation: executed sequentially, individual sessions can operate * Durability: committed transaction become permanent, can survive database (crashes r disruptions) * Transaction Isolation Levels: Different isolation levels resolve concurrency issues in ACID compliant databases(e.g., Read Committed, Serializable, Uncommited, Repeatable(phantom reads possible)) The value read by a transaction remains the same throughout the transaction, impacting concurrency. * serializable provide the highest data consistency, but they can result in a performance overhead. This is because the database might need to acquire locks on data to ensure no other transactions are concurrently modifying the same data. * In high-concurrency environments, Serializable isolation can lead to issues like deadlocks, where transactions wait for each other indefinitely. Serializable Isolation ensures data consistency by making transactions appear as if they were executed sequentially. It uses locking and validation to prevent conflicts, ensuring that the database remains accurate and reliable. Database Locking Mechanisms * Database Locks: Locks are mechanisms that prevent multiple transactions from conflicting with each other. * Types of Locks: shared locks (allowing read access) and exclusive locks (preventing access). Each type has implications for performance and concurrency. * Deadlocks: two users trying to access same resource and waiting for each other causing a standstill. Deadlock resolution strategies include timeout and wait-for graphs. * Granularity of locks: How much of the data is locked at one time (table, column, row, entire table,etc) * Optimistic Locking strategies: expect collisions to be infrequent, limited locking, detect & resolve collisions * NO CONFLICT preferred for internet * Pessimistic Locking strategies: collisions frequently, limit waiting for a lock on a contested resource, updates made safely CONFLICT * database transaction isolation issues and how they are commonly concurrency issues in ACID compliant databases * dirty read: not committed data reads * non-repeatable read The value read by a transaction can change between reads due to another transaction. * phantom read: reads data that has been previously read and finds rows that have been inserted by another transaction (ghost reads * transactions - as the term applies to concurrent/shared relational databases. In a concurrent environment, multiple transactions may be executed at the same time by different users or processes. The database system needs to ensure that these concurrent transactions do not lead to conflicts, inconsistencies, or violations of the ACID properties. Cursors in SQL * Cursors: Cursors are database objects used to retrieve and manipulate data row by row. They are essential for handling complex queries and operations that require iterative processing. * Types of Cursors: Forward cursors allow movement in one direction, while scrollable cursors permit movement in both directions, enhancing flexibility in data handling. New End of Year NoSQL Databases NoSQL Relational * Fixed Schema No Yes * Scalability Horizontal Vertical * Transaction Properties Not ACID (BASE) ACID * Standard Language none SQL Absolutely! Here's a simple explanation of all the key topics to help you prepare for your exam: ________________ ✅ Storage Buckets (Database Types) * Types * graph - composed of a collection of interconnected nodes * essentially a network database * columnar/column - * stores data by columns * key/value pairs Graph Databases What is it? A graph database stores data as nodes and relationships, like a web of connections. Real-World Analogy: Think of Facebook: * People are nodes * Friendships are relationships * You can ask: “Who are Alice’s friends?” or “Who are friends of her friends?” Example: If Alice knows Bob, and Bob knows Charlie, a graph database can quickly find that Alice indirectly knows Charlie. Columnar Databases (Column-Oriented) What is it? A columnar database stores data by columns instead of rows. This makes it super fast for analytics and reporting. Real-World Analogy: Imagine a spreadsheet: * Row-based = read every row one at a time * Column-based = read just the “Price” column for all rows — much faster if that’s all you need! * ________________ ✅ Data Management / Data Governance 🔹 Data Management * What is it? Organizing and maintaining data so it’s usable, secure, and accurate. * Why is it important? * Data is only valuable if it’s correct and accessible. * Helps in making better business decisions. * How is it done? * Data rules, roles, policies. * Using tools for backups, monitoring, cleaning, etc. 🔹 Data Quality (Good data should be...) 1. Accurate – Correct info. 2. Complete – No missing pieces. 3. Consistent – Same format across systems. 4. Timely – Up-to-date. 5. Valid – Matches required formats/rules. 6. Unique – No duplicates. 🔹 Master Data Management (MDM) * A single, consistent version of important data, like customer or product info. * Makes sure everyone uses the same trusted data across the business. ________________ ✅ Database Administration 🔹 Table Index and Performance * Index = like a book index. Helps find data faster. * Without indexes, the database has to check every row (slow!). * Helps improve query speed. 🔹 Performance and High Availability * Performance: how fast queries run. * High Availability: system stays up and running even if parts fail. * 24X7 - 24 hours a day, 7 days a week * five 9s - 99.999% uptime * Database backups and restores * full backups (all data) and incremental (since last full backup) * Database Clusters * Database replication/Mirroring * master (updatable) - slave (no updates/read only) * peer-to-peer - all instances updatable What is Database Replication / Mirroring? Replication or mirroring means copying data from one database to another, so that if one fails, the other can still work. Master-Slave (Primary-Replica) * Master: The main database where you can read and write (update) data. * Slave: A copy of the master, but it’s read-only (no changes allowed). * Many users reading data (from slave) while only the master handles updates. 2. Peer-to-Peer Replication * Every database instance (peer) is equal * All peers are updatable * Changes made in one peer are sent to the others How it works: * If Peer A updates a record, that change is sent to Peers B and C 🔹 Security * Users – People or apps who access data. * Can have permission to see only certain tables, columns, or rows. * Groups/Roles – A way to manage permissions for multiple users. * Example: All admins get full access, all interns get read-only. ________________ ✅ Big Data, Data Warehouse, NoSQL, Alternatives 🔹 Data Warehouse Data Warehousing is the process of collecting, storing, and managing large volumes of data from different sources in one central location, specifically designed for analysis and reporting. Simple Breakdown: * What it does: Combines data from many places (sales, marketing, finance). * Purpose: Makes it easier to run reports, spot trends, and make decisions. * Example: A company might use a data warehouse to analyze customer purchases over the past 5 years. * 🔹 Big Data * Datasets so large and complex that traditional tools can't handle them. 🔹 3 V’s of Big Data 1. Volume – A lot of data (terabytes, petabytes). 2. Variety – Many types of data (text, images, videos). 3. Velocity – Data is created and updated very quickly (real-time). 🔹 NoSQL (Not Only SQL) * A type of database for flexible, large, fast-changing data. * Doesn’t need fixed tables like traditional SQL. * Types: document, key/value, column, graph. Why do we use NoSQL? Because modern apps often need to: * Handle huge amounts of data (think: Facebook, Amazon) * Store different kinds of data (not just rows and columns) * Work fast, even with millions of users * Change quickly — like adding new fields without changing the whole structure * 🔹 Alternative Database Types * Hierarchical – Like a tree (parent-child relationships). Example: XML DB. * Graph – Stores relationships. Example: Neo4j. * LDAP (Lightweight Directory Access Protocol) – Used for directory services (e.g., storing user info for login systems). The relational model is a way to organize data in tables (called relations), where: * Each table has rows (records) and columns (attributes). * Every row is a unique item, and each column holds a type of data (like name, age, ID). * Relationships between tables are formed using keys (like IDs). * SQL Syntax * SQL syntax: * select statement - * joins (outer joins), * Inner * SELECT a.name, b.salary * FROM employees a * INNER JOIN salaries b ON a.id = b.emp_id; * Left/Right just change * SELECT a.name, b.salary * FROM employees a * LEFT OUTER JOIN salaries b ON a.id = b.emp_id; * * Subqueries: A query within another query. within the where clause An outer query and one or more inner queries * correlated (dependent on the outer query) KEYWORD: EXISTS * ex:SELECT e.name, e.salary FROM employees e WHERE e.salary > ( SELECT AVG(salary) FROM employees WHERE department_id = e.department_id * Employees column e.department_id was mentioned making it correlated * Non-correlated (independent, inner query doesnt reference outer query). KEYWORD: IN * ex; SELECT e.name, e.salary FROM employees e WHERE e.salary > ( SELECT AVG(salary) FROM employees ); * * ✅ Table Aliasing * Makes table names shorter in queries. * sql * CopyEdit * SELECT e.name, d.name * FROM employees AS e * JOIN departments AS d ON e.department_id = d.id; * * Insert into (table) (col1,col2) values (1,'hmm') * Update (table) set col1=100, col2='talala' * Delete from (table) where col1=1 View A view is a saved query that lets you "look at" a specific part of the data from that table. It's like creating a shortcut for the data you need most often. Why use a view? * Simplicity: If you have a complicated query that you run often, you can save it as a view and refer to it like it's a normal table. * Security: You can limit what users see. For example, you could create a view that shows only a subset of columns (like hiding sensitive data), and users can access the view without seeing everything. * Reusability: Instead of writing the same complex query over and over, you can create a view and just call it. CREATE VIEW view_name AS SELECT column1, column2, ... FROM table_name WHERE condition; CREATE VIEW EmployeeView AS SELECT EmployeeID, EmployeeName, Department FROM Employees WHERE Department = 'Sales'; Now, whenever you want to see the names and departments of sales employees, you just use: sql CopyEdit SELECT * FROM SalesEmployees; INDEX An index is like a shortcut for finding specific rows in a table. Think of it like the index of a book. Instead of flipping through every page to find a specific topic, you can go straight to the page number listed in the index. Why use an index? * Speed: If you're searching for rows based on certain columns often, an index makes those searches much faster. * Efficiency: Without an index, the database would have to look through every row in the table (a process called "full table scan"), which can take a long time for large tables. CREATE INDEX index_name ON table_name (column1, column2, ...); CREATE INDEX idx_employee_name ON Employees (EmployeeName); Key Differences Between Views and Indexes: * A view is like a shortcut to a query—it makes it easier to reuse complex queries and control what data users can access. * An index is a performance tool—it helps the database find data faster, especially when you're searching for specific values or rows. Test Questions 16. Which of the following is a characteristic or property of something of interest which could be modeled in a relational database? b.) attribute 29. Assume that the faculty table and the faculty_class table have a primary key-foreign key relationship through the faculty_id column. What would be the result of executing the following SQL statement? select * from faculty f join faculty_class fc c.) The statement would run and return an outer Cartesian product between the faculty table and the faculty_class table. Each row in the faculty table would be joined with all rows in the faculty_class table. This means it matches all data even because the on clause is not there so it does not just match the same ids 18. A student is assigned to a dorm for a semester. A student table stores a row of data about a student. A related dorm table contains a row of data for each dorm. What is the maximum cardinality between a row in the student table and a row in the dorm table? One student can be assigned to one dorm and one dorm can be assigned to one student a.) 1:1 20. What would be the results of running this query? select district_number, sum(sales_dollars) from sales group by district_number a.) The query would return rows from the table grouped by district_number with the sum of the sales_dollars for the district_number listed for that row 10. Which of the following is the correct MySQL syntax for adding a check constraint to the cats table which checks that the cat_age column has a value greater than 0? c.) alter table cats modify cat_age int add check ( cat_age > 0 ) Which of the following is a programmatic tool/feature which allows a programming language such as Python to use SQL to read database rows? d.) cursor 24. One issue with database locking is that it may b.) limit concurrency (throughput) in the database 35. Which of the following is the correct syntax for setting a default value of 'ACC' on a table named student in a column named major? a.) alter table student modify major varchar(20) default 'ACC';