Inner vs Outer Joins on a Many-To-Many Relationship

Someone will probably tell me that this is an elementary-level topic, but I got some good questions regarding joins from my most recent ZCE class students, so I thought I'd put down the examples that I used to explain this to them. Being able to join with confidence is a key skill, because it means that you can refactor and normalise your data, without worrying about how hard something will be to retrieve.

The database structure I'm using here looks like this (the same example that I used when I wrote the Databases chapter for PHP Master):

diagram showing table relationships

If you want to grab the actual data I'm using for my examples, then here's the zipped mysqldump file: recipes-db.sql.

The database includes a "many-to-many" relationship; each recipe is linked to the ingredients needed for this particular dish, while each ingredient can of course appear in many dishes. To represent this, there is a recipes table, an ingredients table, and a recipe_ingredients table to show the combinations which apply. To join twice and understand whether you're getting all the rows or just the ones with matches in all the tables can be confusing, so this post is all about showing you some examples.

Fetching Matching Records From All the Tables

Let's say we want to see which ingredients are in which recipe. We'll do this in a number of steps, which I will show below, along with the query and result that goes with that step.

Get all the recipes

mysql> SELECT id, name FROM recipes;
+----+-------------------+
| id | name              |
+----+-------------------+
|  1 | Apple Crumble     |
|  4 | Beans Chili       |
|  5 | Chicken Casserole |
|  2 | Fruit Salad       |
|  3 | Weekday Risotto   |
+----+-------------------+
5 ROWS IN SET (0.00 sec)

OK, so we have some recipes, looking at the relationship diagram, we can see we need to link across to recipe_ingredients using the ID of each recipe.

All recipes, linked with recipe_ingredients

mysql> SELECT r.id AS recipe_id, r.name, ri.ingredient_id 
FROM recipes r 
JOIN recipe_ingredients ri ON (r.id = ri.recipe_id);
+-----------+---------------+---------------+
| recipe_id | name          | ingredient_id |
+-----------+---------------+---------------+
|         1 | Apple Crumble |             1 |
|         1 | Apple Crumble |             5 |
|         1 | Apple Crumble |             7 |
|         1 | Apple Crumble |             8 |
|         2 | Fruit Salad   |             6 |
|         2 | Fruit Salad   |             2 |
|         2 | Fruit Salad   |             1 |
|         2 | Fruit Salad   |             3 |
|         2 | Fruit Salad   |             4 |
+-----------+---------------+---------------+
9 ROWS IN SET (0.00 sec)

There are a couple of different things going on here. Firstly, I have aliased the table names to r and ri respectively. This is so that they are less in the way of us reading the important bits! Since there are multiple tables in the query now, it is important to "qualify" the table names - to say which table this column belongs to so that if there is more than one id column, it's obvious which one you meant.

Secondly, there is now a join between two tables. The query doesn't specify what kind of join it is, but the default join type is INNER JOIN. You might notice that not all of the recipes appear in this result set - only the ones where there are matching records in the recipes_ingredients table will match this query. To get all of the records, we would use an outer join - more on that later.

All the recipes, their ingredients, and the actual display information for those ingredients

To get more than the ingredient_id, we need to join on the ingredients table to get the details. This is a very common type of join, we'll use it when we're linking to pretty much any kind of detail record by its ID. Once again, it's an inner join, and it looks like this:

mysql> SELECT r.id AS recipe_id, r.name, ri.ingredient_id, i.item
FROM recipes r
JOIN recipe_ingredients ri ON (r.id = ri.recipe_id)
JOIN ingredients i ON (ri.ingredient_id = i.id);
+-----------+---------------+---------------+--------------+
| recipe_id | name          | ingredient_id | item         |
+-----------+---------------+---------------+--------------+
|         1 | Apple Crumble |             1 | apple        |
|         1 | Apple Crumble |             5 | flour        |
|         1 | Apple Crumble |             7 | butter       |
|         1 | Apple Crumble |             8 | sugar        |
|         2 | Fruit Salad   |             6 | fruit juice  |
|         2 | Fruit Salad   |             2 | banana       |
|         2 | Fruit Salad   |             1 | apple        |
|         2 | Fruit Salad   |             3 | kiwi fruit   |
|         2 | Fruit Salad   |             4 | strawberries |
+-----------+---------------+---------------+--------------+
9 ROWS IN SET (0.00 sec)

And there you have it - a many-to-many relationship join across three tables. Now you have this result, you can start adding whatever "where" clauses you need to get exactly the information you're after. This dataset includes chef and category, so you could search by all pudding recipes for example (my favourite!)

All the Records, Even Those Without Matches (Outer Join)

How about those "lost" recipes, the ones without details of their ingredients? You might want to include them in your results even if the chef didn't include ingredient details yet. To do this, we'll use an outer join.

The outer join brings in all the records, plus any matching ones. So in our example, we'll see all recipes, plus any ingredients records that match. If there are no ingredients that match, then we'll still get those records, but with a NULL in the ingredients fields.

Outer joins come in two flavours: LEFT and RIGHT. The way that I think of this is that the first table is the one on the left; if you want all the records from that first table, plus any matches from the second, you'll choose LEFT. If you want all the records from the second table, plus information from the first one if there is any, you'll choose RIGHT. Take a look at the diagram at the top of this post, and you'll see that we have the recipes table on the left of the diagram, and first in our query, so we'll be using a LEFT join here. (as a complete aside, I almost always have left joins, because I build up my queries that way around: getting the records I want, then the additional information to go with them, and so on. Maybe that's just the way my brain works?). Here's our recipes and ingredients again, but including the recipes without ingredients:

mysql> SELECT r.id AS recipe_id, r.name, ri.ingredient_id, i.item 
FROM recipes r 
LEFT JOIN recipe_ingredients ri ON (r.id = ri.recipe_id)
LEFT JOIN ingredients i ON (ri.ingredient_id = i.id);
+-----------+-------------------+---------------+--------------+
| recipe_id | name              | ingredient_id | item         |
+-----------+-------------------+---------------+--------------+
|         1 | Apple Crumble     |             1 | apple        |
|         1 | Apple Crumble     |             5 | flour        |
|         1 | Apple Crumble     |             7 | butter       |
|         1 | Apple Crumble     |             8 | sugar        |
|         4 | Beans Chili       |          NULL | NULL         |
|         5 | Chicken Casserole |          NULL | NULL         |
|         2 | Fruit Salad       |             6 | fruit juice  |
|         2 | Fruit Salad       |             2 | banana       |
|         2 | Fruit Salad       |             1 | apple        |
|         2 | Fruit Salad       |             3 | kiwi fruit   |
|         2 | Fruit Salad       |             4 | strawberries |
|         3 | Weekday Risotto   |          NULL | NULL         |
+-----------+-------------------+---------------+--------------+
12 ROWS IN SET (0.00 sec)

Since we're working on a many-to-many relationship here, we're joining three tables in total. This means two joins: one between the first table and the linking table, and another from the result of that query to the other end of the relationship. In this example, this means we had two left joins. Each join operates between two data sets, which are the resultset you have so far, and the new table that is being joined. Since we still want to have records from the recipe/recipe_ingredients join even if there isn't a matching record in ingredients, then the second join here also needs to be an outer join.

Joining All Tables

My initial diagram showed five tables in all, so here's the query and result for pulling all that data together:

mysql> SELECT r.name AS recipe, c.name AS category, ch.name AS chef, i.item AS ingredient
FROM recipes r 
LEFT JOIN recipe_ingredients ri ON (r.id = ri.recipe_id)
LEFT JOIN ingredients i ON (ri.ingredient_id = i.id)
INNER JOIN chefs ch ON (r.chef_id = ch.id)
INNER JOIN categories c ON (r.category_id = c.id);
+-------------------+----------+-------+--------------+
| recipe            | category | chef  | ingredient   |
+-------------------+----------+-------+--------------+
| Weekday Risotto   | Main     | Lorna | NULL         |
| Beans Chili       | Main     | Lorna | NULL         |
| Chicken Casserole | Main     | Lorna | NULL         |
| Apple Crumble     | Pudding  | Lorna | apple        |
| Apple Crumble     | Pudding  | Lorna | flour        |
| Apple Crumble     | Pudding  | Lorna | butter       |
| Apple Crumble     | Pudding  | Lorna | sugar        |
| Fruit Salad       | Pudding  | Lorna | fruit juice  |
| Fruit Salad       | Pudding  | Lorna | banana       |
| Fruit Salad       | Pudding  | Lorna | apple        |
| Fruit Salad       | Pudding  | Lorna | kiwi fruit   |
| Fruit Salad       | Pudding  | Lorna | strawberries |
+-------------------+----------+-------+--------------+
12 ROWS IN SET (0.00 sec)

Once you have this dataset, you can filter it down any way you want to. Since the chef and category elements are required fields, I used inner joins for those; left joins would have given the same results though on this dataset.

If you were looking for examples of inner and outer joins across a real (if simple!) database with multiple joins, I hope this helped. All this talking about food is making me hungry, I'm off to the kitchen :)

25 thoughts on “Inner vs Outer Joins on a Many-To-Many Relationship

  1. Jacques, you can add WHERE clauses after the JOINs, so like if you only wanted it for the "Pudding" category, you'd do something like this in the last example ( I don't know what the id for pudding is):

    ...
    INNER JOIN categories c ON (r.category_id = c.id)
    WHERE c.id = 2;

    This will give you just the recipe info in that category.

  2. Pingback: Inner vs Outer Joins on a Many-To-Many... | MySQL | Syngu

  3. Jacques: apart from being able to add a where after the joins, you might be interested to know that the "ON" in the join clause is essentially a where, too. So you can also say "LEFT JOIN foo ON bar.bar_id = foo.bar_id AND active = true";

    Lorna; although I agree that in some cases, adding an alias to a query is a good idea, I disagree in this instance. I personally find "categories.name" much easier to read than "c.name". As SQL queries are generally write-once, read many, I think being verbose is actually helpful. In the rest; very good article! I wish I'd run into such an article years ago ;)

  4. Take the last example: how would refine it to one row per recipe , but still have all the ingredients associated with that recipe e.g. In comma separated list?

    • Grab just the recipe you want by adding a WHERE clause on either recipe name or id. The comma-separated bit I would probably do at the application level, but MySQL does have a group_concat function that you can use with group by to return results with comma-separated values

      • If you do it at application level you have an extra query. It would interesting to know the pros and cons and performance analysis of the different approaches to dealing with that.

        For example, if you are listing recipes (10/page), and each recipe is to be listed along with its ingredients: grouping and concatinating the ingredients in one query will surely be better, because if you were to do that at application level it would be eleven queries.

        I guess what I'm talking about is what is a tagging schema:
        http://stackoverflow.com/questions/664283/scalable-database-tagging-schema

      • Would you be so kind as to show an example of GROUP_CONCAT for your recipe database? I've looked at the official documentation for it, and am still at a bit of a loss as how to use it properly.

        • Sure, the example here just needs you to wrap the group_concat around the ingredients.items column, then group by the recipe. Try this:

          SELECT r.name, group_concat(i.item) AS ingredients
          FROM recipes r 
          LEFT JOIN recipe_ingredients ri ON (r.id = ri.recipe_id)
          LEFT JOIN ingredients i ON (ri.ingredient_id = i.id)
          GROUP BY r.name

          Hope that helps!

  5. Pingback: SQL Joins with On or Using | LornaJane

  6. I have a many-many with an interesting wrinkle: in 90% of the cases, it's 1:1. Is it foolish to partially de-normalize in such a case?

    Example: a "harvest" is performed by no less than one person, but sometimes there are multiple people involved, and we want to record that fact and be able to use it. So my "harvests" table has a "who" field that is a foreign key into the "persons" table, and a "harvest-person" table that only contains entries for when there are two or more people involved with the harvest.

    This makes the simple case of showing who the lead harvester was very simple, but it complicates the case of listing all harvesters.

    I mentally justify this denormalization by noting that the "lead" harvester is fundamentally different than the "other" harvesters, but perhaps I'm just deceiving myself. (The "harvest-person" table would need an "isLead" boolean, anyway.)

    Thoughts?

    • Thank you Lorna for your didactic explanation on the joins.

      Jan, very interesting your example and questions. I have a question for you: Does a lead harvester exist as a fact or is just a name that you used in the case of having many harvesters in a harvest?

  7. OMG, this is the clearest explanation I've ever read of what a LEFT and RIGHT OUTER JOIN are. Epiphany!! Thank you.

  8. Lorna, I have a doubt. How do I get the recipes that have two or more specific ingredients? For example, all the recipes that use apple and sugar as ingredients. I was trying to figure out how to do that, but I think my SQL knowledge is not enough. I really appreciate if you could help me. Thanks.

    • You're right, you can't do this with joins alone. What you want here is an aggregate function, where you select all the recipes and join on to all the ingredients in those recipes, then use something to perform calculations on the *result* of the join. The manual page you want is probably this one: https://dev.mysql.com/doc/refman/5.6/en/group-by-functions.html and you should "group by" the recipe and add the list of ingredients as multiple "having" clauses. Hopefully that gets you started.

      • Actually, you can with joins:
        [code]SELECT r.id AS recipe_id, r.name, ri.ingredient_id, i.item
        FROM recipes r
        JOIN recipe_ingredients ri ON (r.id = ri.recipe_id)
        JOIN ingredients i ON (ri.ingredient_id = i.id)
        JOIN recipe_ingredients ri2 ON (r.id = ri2.recipe_id)
        JOIN ingredients i2 ON (ri2.ingredient_id = i2.id)
        where
        i.Item = 'apple' and
        i2.Item = 'sugar'[/code]

  9. Can you please write the query to get recipie name having ingredients 'apple' and 'flour'?
    it hould return only 'Apple Crumble' not 'Apple Crumble','Fruit Salad'

Leave a Reply

Please use [code] and [/code] around any source code you wish to share.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>