Fundamentals of Working in Data: SQL (Part 1: Overview)
Introduction
If you want to work in data, knowing Structured Query Language, or SQL (pronounced "sequel"), is essential. Even though such things like Python, Cloud Architecture, Machine Learning, and AI are praised as being highly valuable (which they are), understanding SQL is the most important first step. This is because this is the very foundation of how professionals interact with data.
When I first got started, I had the thought, "SQL syntax is very simplistic. How much depth could there be, really?". As of writing this post, I've been working in data at some capacity for 7 years (Data Analyst then Data Engineer), and I use SQL every day.
This post is to give a basic overview of WHAT SQL is, WHY professionals use it, and HOW it works. I to instill the value of learning SQL first for people aspiring to become data professionals. Then I'll provide some recommended next steps to
What SQL Is (WHAT)
SQL is the standardized method of managing and manipulating data for relational databases. It can have different flavors (T-SQL, PL/SQL, SQLite, PostreSQL, etc.) but is fundamentally the same: a declarative programming language optimized for managing data.
SQL can find data (querying), modify data (INSERT, UPDATE, DELETE), and performing automated operations on said data (stored procedures).
Under the hood, the SQL commands written are used to efficiently search and retrieve data.
Don't let the simple syntax fool you. It hides a lot of power.
How Professionals Use SQL (WHY)
The nature of falls into one of these four categories. Below are the high-level explanations of what each role is trying to solve:
Database Administrator and Data Engineer:
Role Objective: Establish and maintain data environments for efficient downstream use.
SQL Context: Manage and optimize data infrastructure, including database design, maintenance, and optimization.
Data Analyst:
Role Objective: Utilize querying to derive insights and create data visualizations.
SQL Context: Analyze data to derive insights that are used for data-driven decisions. This includes database querying, statistical analysis, and data visualization creation.
Data Scientist:
Role Objective: Develop and use machine learning models (algorithms) to derive insights.
SQL Context: Use SQL for data extraction and analysis to develop predictive models and uncover complex patterns in data.
Machine Learning Engineer:
Role Objective: implement machine learning solutions using SQL for data preprocessing, model development, and deployment.
SQL Context: Deploy and train machine learning models to automate tasks and scale predictions.
How SQL Works (HOW)
SQL Basics
-- This query retrieves the top 5 product categories
-- in the "Produce" department with the highest average prices
-- for items purchased in the year 2023.
SELECT product_category, AVG(price) AS avg_price
FROM groceries
WHERE purchase_year = 2023 AND department = 'Produce'
GROUP BY product_category
HAVING avg_price > 2.50
ORDER BY avg_price DESC
LIMIT 5;
SELECT
The clause used to define which exact fields are desired set of data
Default is from Tables, but can be other sets (sub-queries, common table expressions, views, etc.)
- Example: select the fields "product_category" and average "price"
FROM
The clause the determine which set of data to use
- Example: Uses the "groceries" table
WHERE
Clause that is the first filter: it filters out all data that does not match its conditions
- Example: Ignore data that is not within "purchase_year" of 2023 and not of department "Produce"
GROUP BY
Aggregates the result set based on an aggregate function and it's non-aggregated fields
- Example: Group all average prices ("avg_price") by "product_category"
HAVING
Just like the WHERE clause, except done after and by grouped set
- From Example: Filter out any grouped data that does not have an average price ("avg_price") of 2.50
ORDER BY
Sorts the result set based on the fields listed, in order. Sorts ascending (ASC) by default.
- From Example: Sorts the result set based on "avg_price" from highest to lowest (descending --> DESC).
LIMIT
Exclude all rows beyond the amount listed.
- Show only the top 5 results
Data Retrieval (Querying)
From the above example, you can see how this is done. To go into more depth, querying is all about finding something specific and getting the result set.
Result Sets
In this case, the point of a query is to find the specific data you are looking for and ignore everything else. That is what queries serve to do, as getting all of the information forever serves no purpose. It's the same reason why Google searching exists: people want to find the information they need.
In the above example: We only need to know the Top 5 average prices of produce sold in 2023. Why would this be valuable?
If it were for a grocery store, this would provide insight as to what to prioritize. Maybe they have to reallocate budget to other areas, but want to keep their best selling produce and prioritize getting those even if they're out of season.
But what if one data source is not enough? What if we wanted to compare that to other stores in the same franchise? We can utilize Join to do this.
JOINs
JOINs are a way of joining two tables or result sets together. To answer the question from above, look at the query below with a join:
-- This query retrieves the top 5 product categories in the 'Produce'
-- department, across all stores in the 'Yummy Groceries Inc.' franchise,
-- with the highest average prices for items purchased in the year 2023.
SELECT g.product_category, AVG(g.price) AS avg_price
FROM groceries AS g
INNER JOIN stores AS s ON g.store_id = s.store_id
WHERE g.purchase_year = 2023
AND g.department = 'Produce'
AND s.franchise = 'Yummy Groceries Inc.'
GROUP BY g.product_category
HAVING avg_price > 2.50
ORDER BY avg_price DESC
LIMIT 5;
This is the same basic query as before, but with some extra considerations. Let's go over the basics of each new part, then elaborate.
FROM groceries AS g
We added an alias to our table. This is for convenience, as we can use it to as a shorthand reference from then on, instead of writing "groceries" every time.
INNER JOIN stores AS s
ON g.store_id = s.store_id
We added a new table called "stores" and gave it the alias "s". The "INNER JOIN" means we want to include records from both tables if and only if they match both conditions ON it. In this case, only return records that have matching "store_id" in both tables.
WHERE g.purchase_year = 2023
AND g.department = 'Produce'
AND s.franchise = 'Yummy Groceries Inc.'
Once we joined our data, then we want to filter further in the WHERE clause. From here we only keep records that have
grocery store purchase year of: 2023
grocery store department of: Produce
store franchise of: Yummy Groceries Inc.
JOINS - A Holistic Overview
Starting out, you should focus on INNER JOIN (JOIN) and LEFT OUTER JOIN (LEFT JOIN) because those are the most practical. However, I do want to mention all of them just for your information.
INNER JOIN (JOIN)
Definition: Only return rows that have matching values in both sets
Use-Case: See above example
LEFT OUTER JOIN (LEFT JOIN)
Definition: Return all rows in the set in the FROM clause, and matching rows from set in the LEFT JOIN
Use case: If we wanted to know all products from Grocery Store A, and the ones that match in Grocery Store B.
- Could be used to determine which are missing from Grocery Store B.
RIGHT OUTER JOIN (RIGHT JOIN)
Definition: Return all rows from the sets in the RIGHT JOIN, and matching rows in the set in FROM clause
- Note: This is less intuitive than LEFT JOIN, and essentially does the same thing in reverse. Do this if you want to, but it's not needed
FULL OUTER JOIN
Definition: Return all rows from both sets if there is a match
Use cases: Handling missing data, comparing data from two sources, merging data from two sources
- This is handy for analysis, but it depends on the business problem
CROSS JOIN
Definition: Returns the Cartesian Product (all possible combinations) between both sets
Use case: Creating a lookup table that contains any combination, such as a list of grocery stores with all products.
- This is handy for analysis, but it depends on the business problem
Next Steps
If you want to get started working in data, then I hope this convinced you how valuable learning SQL can be. So the question you may have is, "OK, but what do I do now?".
I can suggest the following:
Think about what exactly you want to learn data for. Think about what type of work you'd be interested in. Keep this in mind for later.
- Don't skip this step, but also don't get stuck here. You may not know yet, which is totally fine. You'll have to try things out for yourself and see what works. For now, just have a basic idea in mind for later.
Think about how you enjoy learning. This will dictate what kind of approach you'd want to take. The important thing is to understand the theory and apply it with exercises as practice.
Free Content: When you want to learn but don't want to pay a premium
-
pros: exercises with an online SQL compiler in the browser
cons: may lack the depth and breadth required
- May be a good starting point, but find additional exercises afterward
-
Paid Course/Subscription: When you like a video with exercises to do at your own pace
-
pros: has curated data instructors, exercises with an online SQL compiler in the browser, certificate of completion
cons: paid subscription, completion certificate is not a sign of mastery
- I did this course as a refresher, and I think it's fantastic.
-
Book: If you like going through a curated book with exercises
T-SQL Fundamentals by Itzik Ben-Gan
pros: has comprehensive and holistic view of T-SQL (Microsoft SQL), exercises to use, etc.
cons: may be intimidating for some beginners, setting up your own environment required
- I recommend that people use this as a supplement to other methods. Great book for theory and understanding.
Return to Item 1, and think about what interested you in the exercises. Go towards that direction, based on what you want to do.
Good luck! I'm rooting for you.