SQL – the data awakens

Yesterday I finished “Intro to SQL – Querying And Managing Data” course on Khan Academy. Overall, it was a pretty good course. I wanted to make some notes on what I have learned in the past few days. I really like their method of teaching where they have a 5 minute video lecture and then interactive exercises. You can learn at your own pace. If you don’t understand something, you can pause, rewind and watch that section again.

For those that don’t know much about Khan Academy, I’d encourage you to watch this TED video where Salman Khan explain its philosophy and how he started it.

SQL course content

So what did I learn in the course and in some related videos on Youtube?

Firstly, SQL is a couple of things:

  1. Relational: It is a relational database management base (“RDBMS”). This means that there is relationships between the data. It has been setup with a specific schema (i.e structure). It can be queried to ask for the information in a table format.
  2. Can be split up into smaller tables: You can have all the information in one table. However, what if you have a million rows of data? If you had it in one table, it can make it slower to write to and read. So you can split the data into smaller tables to make it faster. Then join them together as needed using a JOIN query.
  3. The data stored is not readable by humans: The data is stored in a format that cannot be read by humans. Hence a SQL query is required to extract the information into a tabular format.
  4. Database is scalable: You can put into information into a spreadsheet. However, there is a limit to how much data you can hold and the number of rows. A SQL database is scalable. However, when you come up against web scale and distributed systems, SQL may not be suitable. Hence some firms have opted for NoSQL. More on this later!

Some SQL queries I learnt 

Here are most of the queries I learnt.

  • SELECT * FROM table query: This allows you to pull up all the data from the table
    • If the table name is “students” which contain a bunch of student data. eg. SELECT * FROM students
  • CREATE a table: How to create a new table. For example, we could create a table with the students in our class for user ID, first name, last name, email, age.
    • CREATE TABLE students (id INTEGER PRIMARY KEY, user_id NUMERIC, first_name TEXT, last_name TEXT, email TEXT, age NUMERIC)
    • INTEGER PRIMARY KEY: This is the unique identifier e.g. 1, 2, 3 that is attached sequentially to a record. It can be added using auto increments. Two records cannot have the same integer primary key.
  • How to select specific columns of data from the table to display (instead of the whole table)
    • e.g. SELECT first_name, email, age FROM students
  • WHERE query: how to add a rule to query data which meets a WHERE query.
    • e.g. SELECT first_name, email, age WHERE first_name = matthew
  • INSERT data: how to add a row of data into an existing table.
    • e.g. INSERT into STUDENTS (user_ID, first_name, last_name, email, age) VALUES (541023, matthew, ho, matthew@matthewho.com, 18).
    • Typically you want to insert new data or use UPDATE data, not delete data.
  • GROUP BY: How to group related information together. e.g. all classes taken by student with user ID 541023.
  • ORDER BY: How to order information in a table using a ORDER BY query. e.g. using DESC (descending order).

Joining multiple tables

There is a good post on Stackoverflow on why you would use one table with many columns or multiple tables.

Any time information is one-to-one (each user has one name and password), then it’s probably better to have it one table, since it reduces the number of joins the database will need to do to retrieve results. I think some databases have a limit on the number of columns per table, but I wouldn’t worry about it in normal cases, and you can always split it later if you need to.

If the data is one-to-many (each user has thousands of rows of usage info), then it should be split into separate tables to reduce duplicate data (duplicate data wastes storage space, cache space, and makes the database harder to maintain).

JOIN Queries

  • JOIN query: How to join multiple tables together using “JOIN” and “ON” query.
    • You need to figure out which tables to join and on a common attribute.
    • Suppose there are two tables, one containing student info and another containing which classes there are taking. So how can we match up the student user ID, first name, subjects they are studying?
    • You could use a universal value which is user_id across multiple tables to identify the student. A query would look something like this:

SELECT user_id, firstname, subjects

FROM students
JOIN classes
ON user_id

  • How to do LEFT OUTER JOINS to fill in information from the left column. Conversely the same with a RIGHT OUTER JOINS.

Combining multiple joins

  • AS: How to relabel a column e.g. one that combines information from multiple table to form a new column.
  • Alias: When you join information from the same table to itself, you have to give it an alias e.g. “A” and “B”.
  • Khan Academy uses an example called “Friendbook”. This is the exercise: “We’ve created a database for a friend networking site, with a table storing data on each person, a table on each person’s hobbies, and a table of friend connections between the people. Now, use SELECT with a JOIN to show the names of each pair of friends, based on the data in the friends table.”. You can view the problem here.
  • This is the answer below which uses multiple joins between 2 tables and alias to differentiate information from table that is used twice.

SELECT a.fullname AS friend1, b.fullname AS friend2
FROM friends
JOIN persons a
ON friends.person1_id = a.id
JOIN persons b
ON friends.person2_id = b.id;

More video resources on SQL

I found these two video to be really good on explaining what is SQL and how it is used.

Notes on NoSQL

I took a detour whilst doing my SQL course as I started coming across NoSQL and through the advice of my friends to check it out. I stumbled across this excellent video from Martin Fowler from Thoughtworks on NoSQL (see below). Here he discusses the history of databases and the common characteristics of NoSQL.

I thought it was pretty funny that the name “NoSQL” came from a hashtag for the first meetup for these kind of databases in San Francisco.

The difference with NoSQL is that its a different data model to SQL. Essentially, the model is non-relational. The database does not have relationships like in SQL. The information can be stored in an aggregate format.

Hence some of them are classified as “aggregate-orientated” NoSQL databases. These ones grouped under this classification include key value store, document, column family databases. The exception is a “graph database” in NoSQL.

The needs for NoSQL came about because of web scale. Companies like Amazon, Khan Academy, Google, Facebook needed a different type of database to handle the scale they were dealing with and also their infrastructure which uses clusters & distributed systems.

There’s a lot more to NoSQL, but that’s my understanding so far and the easiest way to explain it.

Since I had a basic understanding of SQL after going through the Khan Academy course and Youtube videos above, it gave me a good grounding to understand NoSQL. I’ll write some more on NoSQL in a separate post when I’ve got my head around it.

I’m out like the Star Wars movie,

Matt Ho.

Leave a Reply

Your email address will not be published.

CAPTCHA * Time limit is exhausted. Please reload CAPTCHA.