First, we are still within the confines of a single table. Instead of working with the theoretical table "bob" we instead have "sal" (as illustrated below):
LastName FirstName Address1 City
------------ ------------ ---------------- --------
Addams Blake 632 Aspen Dr. Manhattan
Anderson Sam 68132 Nichols Way Peoria
Caldwell David 813 Excelsior Cir. Denver
Albrecht Brigitte 97 Sunnybrook St. Aurora
Brandes Werner 772 Aspen Dr. Manhattan
Elliott Susan NULL NULL
Gaspar Sammy 3632 Allen Way Vernon
If you eyeball it, a few things might stand out. If you've been following my lessons, the presence of NULL values should be a warning flag to you. The values in any column are in no particular order. There are some people in the same city (in fact, the same street), and a couple people with similar names (Sam and Sammy). This small sampling of names and locations should help us form some better query-writing techniques.
First, I'll mention how to handle strings (textual data) versus numbers. There are other data types out there, but for the short term these will be the focus of our data type analysis.
Strings, in the computer world, can be described as arrays of characters. If you aren't familiar with a character, suffice it to say that you're staring at a great many right now (letters, numbers, even spaces and punctuation count as "characters"). Numbers, as you have probably guessed, are numeric, and may or may not have decimal places associated with them. What they won't have are commas (assuming an American notation), as they are not necessary for a computer in order to parse numbers.
Now that we know what these types are, we should next look at how to represent them. If I were to enter the following query:
SELECT *
FROM sal
WHERE LastName = Addams
... we're going to have some problems. The way we've written this, we're comparing the field named LastName against the field named Addams. Except, we don't HAVE a field named Addams. So how should we have done this? People familiar with computer representations of strings are probably thinking quotation marks ("). Close, but not quite.
Instead, SQL is a bit of an odd duckling, in that it treats things wrapped in single apostrophes as strings. Thus, 'MacGyver' is a string, where "Angus" isn't. The quotation marks do serve a purpose, but not in this context, so take care to keep that apostrophe handy when working with strings.
Numbers are a bit simpler. You may or may not have the decimal delimiter (the period (.) for my American brethren) present, depending on the value in question. 3, 03, 3.00, and 003.00 are all valid numbers, and represent the same value (3). Numbers are not wrapped in quotes of any sort, as it is assumed that when something starts with a number, it is going to be a numerical value. This is why you should not prepend any column or table name with a number (e.g. a field named 3Times). It is possible to work with such things, but that's outside the scope of this lesson, so for the time being, assume that all column names and table names start with alphabetical characters.
Now that we have that out of the way, it's time to write another query. We want to return all columns ("select star") from the table "sal" where the person lives in the city of Manhattan. Using the principles described in earlier paragraphs, as well as the previous lesson, we would create the query as follows:
SELECT *
FROM sal
WHERE City = 'Manhattan'
If we do this, we should receive two records as a return value:
LastName FirstName Address1 City
------------ ------------ ---------------- --------
Addams Blake 632 Aspen Dr. Manhattan
Brandes Werner 772 Aspen Dr. Manhattan
Note that we treated the city (Manhattan) as a string, since it is comprised of alphabetical characters. Had we been working with the Address1 field, we would still have treated it as a string, despite the number at the beginning. Any field that has the POTENTIAL to contain a string is treated as a string.
In the last lesson, when I introduced the "select star" concept, I made an allusion to "wildcards." The computer concept of a wildcard is just as the name suggests, it could be practically anything. More specifically, in this context, the asterisk (*) character represents "any quantity of anything." In the context of fields (the "select star"), we are stating that we want any column available to us, no matter how many of them are there. A table with 1 column would give us 1 column in the results, and a table of 10 columns would give us all 10.
We can apply a similar concept to strings. The string 'Man*' can be read as "anything starting with 'Man'." Similarly, '*Man' would be read as "anything ending in 'Man'," and '*Man*' would read as "any string with 'Man' somewhere within it." Getting too abusive with wildcards can create unpleasant results, as well as having an adverse effect on performance of queries, but there are many occasions where you may know part of a string but not the whole thing. In cases where we have partial string criteria, we say a string is LIKE another string.
LIKE? What happened to equals (=)? Well, there may be legitimate cases where you want to return string values that have an asterisk in them. Saying
SELECT *
FROM sal
WHERE City = 'Man*'
... would literally look for every record whose City is Man*. We don't have any such value, so we'll end up not getting anything back. If your purpose above was to look for everything starting with Man, then your choice of comparison operator (=) would have been incorrect, as we're being explicit in our comparison against Man*. Instead, using LIKE as an operator would give us the desired result:
SELECT *
FROM sal
WHERE City LIKE 'Man*'
LIKE only works with text data, but it can be extremely powerful if used properly. However, you should have a very clear idea of what data you're working against before leveraging this operator, as you could just as easily get cities like Manchester and Manassas from such a query. Improper use of such a loose comparison is a leading cause of "flaky" results.
One of the ways to keep such a query from getting out of control is to add additional criteria. If we know for a fact that the only city starting with "Man" that has an Aspen Drive is Manhattan, we could bunch that condition in as well:
SELECT *
FROM sal
WHERE City LIKE 'Man*
AND Address1 LIKE '*Aspen Dr*'
Note that I have asterisks on both ends of the address. The idea behind this is that you could have any number of things (house number, cardinal direction, etc.) at the beginning, and the way "drive" is written could appear as "Dr", "Dr." (with a period), or "Drive" -- and you may wish to see any of the above. However, just remember that every wildcard you add to the query will have an effect (trivial in this case, but in major queries it could add significant time) on performance. Having said that, you'll do what you must to get by, and this is a valuable tool for your arsenal.
Okay, I've been making things too easy up until now. Let's step it up with a third criterion:
SELECT *
FROM sal
WHERE City LIKE 'Man*'
AND Address1 LIKE '*Aspen Dr*'
OR FirstName LIKE 'Sam*'
See any problem with this query? Technically, it will run, and will give you values back, but can you tell just by looking at it WHICH values it will give back? The issue here is that we have two different boolean operators (AND and OR), but how do we know what order they're evaluated in? Are we saying to first determine the City AND Address1, and then after that say OR FirstName? Or, are we really saying to compare City first, then after that we take either the Address Or the FirstName?
The issue I'm describing is "operator precedence." Knowing whether AND has a greater priority than OR, and applying the same to other operators, can make the difference between the CORRECT query and a BAD query. While we may be syntactically correct in how we wrote the above, for readability purposes, we need to polish it up a bit. Let's just assume that we ALWAYS want to have a City like 'Man*', but we'll take either the Address1 or the FirstName condition. The use of parentheses, much like what you'd see in algebra, brings things together nicely:
SELECT *
FROM sal
WHERE City LIKE 'Man*'
AND (Address1 LIKE '*Aspen Dr*'
OR FirstName LIKE 'Sam*')
Don't mind my indentation too much, it means something to me but you'll have to do whatever feels right to you. To me, the OR being deeper than the AND tells me that we look at the AND before we do the OR. However, indentation has no effect on how the query is executed. Instead, we're relying on the use of the parentheses to do this for us. If you're still confused, consider the following facts:
- There are three separate criteria. We'll call the City criterion "X", the Address1 criterion "Y", and the FirstName criterion "Z"
- Either Y OR Z can pass, but AT LEAST ONE of the two must pass
- X must pass no matter what
- Written in boolean algebra, (+ for OR, and * for AND) this is X * (Y + Z)
Truth table time:
X Y Z X * (Y + Z)
--- --- --- -------------
No No No No
No No Yes No
No Yes No No
No Yes Yes No
Yes No No No
Yes No Yes Yes
Yes Yes No Yes
Yes Yes Yes Yes
Nested conditions like this (having a condition within a condition) can quickly hurt the readability, so having a good way to keep track of how deep you are and how many parentheses you're working with is important, particularly when dealing with functions. The way I deal with this, if I know beforehand what the number of conditions will be, is I'll actually write the open and close parentheses () first, then fill in the blank. This way, I don't run the risk of forgetting to close them. If readability is a concern, adopt a good indentation or commenting technique to help you keep track of where you are. Also, it might make more sense for you to keep multiple clauses and statements on the same line, whereas I make an effort to divide them up as often as possible, but either way is fine so long as it makes syntactical sense.
An example of deeper nesting could be something like (using the above example as a base):
SELECT *
FROM sal
WHERE City LIKE 'Man*'
AND ((Address1 LIKE '*Aspen Dr*'
OR FirstName LIKE 'Sam*')
AND Address1 NOT LIKE '*son')
If you consider our new criterion (Address1 NOT LIK '*son') AS W, keeping the original substitution variables from above, we get the following boolean algebraic formula: X * ((Y + Z) * W). If you know your boolean algebra, or even just regular algebra (there's a reason boolean algebra uses this notation), it should make sense that you could make the formula cleaner by changing it to W * X * (Y + Z), so an equivalent (and easier to read) statement would be:
SELECT *
FROM sal
WHERE Address1 NOT LIKE '*son'
AND City LIKE 'Man*'
AND (Address1 LIKE '*Aspen Dr*'
OR FirstName LIKE 'Sam*')
While a strong sense of boolean algebra is not required, it can be VERY beneficial to a SQL writer's bag of tricks. Minimization of formulas can become a science, but there's a reason many computer engineers/scientists are required to learn these kinds of things early on in college. I would strongly encourage any up-and-coming SQL expert to devote some time in this area if they are lacking.
Well, in closing, the things you should take away from this lesson are the LIKE vs. = (equal to) operators, the use of parentheses to distinguish between priorities for condition evaluation, and a sharp eye might even see the NOT operator in use.
Some homework I can dump off onto you would be:
- Read up on boolean algebra
- Consider the other "wildcard" character, the underscore (_). Where the asterisk allowed for any potential number of unknown characters (0 to ?), the underscore specifically means "any one and ONLY one character." Saying WHERE bob LIKE 'ab_de' will work for 'abcde', 'abede', 'ab de', but NOT for 'abde' or 'abarde'.