Thursday, April 9, 2009

Basics - Part 2: The WHERE clause

While the last lesson's concepts will be enough to give you the name of a person whose age is less than or equal to a certain value, this cannot possibly be the only valid for using SQL. If this was one of the many thoughts in your head right now, then I think we're ready to move on to more complex queries.

First, we are still within the confines of a single table. Instead of working with the theoretical table "bob" we instead have "sal" (as illustrated below):

LastName        FirstName       Address1            City
------------    ------------    ----------------    --------
Addams          Blake           632 Aspen Dr.       Manhattan
Anderson        Sam             68132 Nichols Way   Peoria
Caldwell        David           813 Excelsior Cir.  Denver
Albrecht        Brigitte        97 Sunnybrook St.   Aurora
Brandes         Werner          772 Aspen Dr.       Manhattan
Elliott         Susan           NULL                NULL
Gaspar          Sammy           3632 Allen Way      Vernon


If you eyeball it, a few things might stand out. If you've been following my lessons, the presence of NULL values should be a warning flag to you. The values in any column are in no particular order. There are some people in the same city (in fact, the same street), and a couple people with similar names (Sam and Sammy). This small sampling of names and locations should help us form some better query-writing techniques.

First, I'll mention how to handle strings (textual data) versus numbers. There are other data types out there, but for the short term these will be the focus of our data type analysis.

Strings, in the computer world, can be described as arrays of characters. If you aren't familiar with a character, suffice it to say that you're staring at a great many right now (letters, numbers, even spaces and punctuation count as "characters"). Numbers, as you have probably guessed, are numeric, and may or may not have decimal places associated with them. What they won't have are commas (assuming an American notation), as they are not necessary for a computer in order to parse numbers.

Now that we know what these types are, we should next look at how to represent them. If I were to enter the following query:

SELECT *
    FROM sal
    WHERE LastName = Addams


... we're going to have some problems. The way we've written this, we're comparing the field named LastName against the field named Addams. Except, we don't HAVE a field named Addams. So how should we have done this? People familiar with computer representations of strings are probably thinking quotation marks ("). Close, but not quite.

Instead, SQL is a bit of an odd duckling, in that it treats things wrapped in single apostrophes as strings. Thus, 'MacGyver' is a string, where "Angus" isn't. The quotation marks do serve a purpose, but not in this context, so take care to keep that apostrophe handy when working with strings.

Numbers are a bit simpler. You may or may not have the decimal delimiter (the period (.) for my American brethren) present, depending on the value in question. 3, 03, 3.00, and 003.00 are all valid numbers, and represent the same value (3). Numbers are not wrapped in quotes of any sort, as it is assumed that when something starts with a number, it is going to be a numerical value. This is why you should not prepend any column or table name with a number (e.g. a field named 3Times). It is possible to work with such things, but that's outside the scope of this lesson, so for the time being, assume that all column names and table names start with alphabetical characters.

Now that we have that out of the way, it's time to write another query. We want to return all columns ("select star") from the table "sal" where the person lives in the city of Manhattan. Using the principles described in earlier paragraphs, as well as the previous lesson, we would create the query as follows:

SELECT *
    FROM sal
    WHERE City = 'Manhattan'


If we do this, we should receive two records as a return value:

LastName        FirstName       Address1            City
------------    ------------    ----------------    --------
Addams          Blake           632 Aspen Dr.       Manhattan
Brandes         Werner          772 Aspen Dr.       Manhattan


Note that we treated the city (Manhattan) as a string, since it is comprised of alphabetical characters. Had we been working with the Address1 field, we would still have treated it as a string, despite the number at the beginning. Any field that has the POTENTIAL to contain a string is treated as a string.

In the last lesson, when I introduced the "select star" concept, I made an allusion to "wildcards." The computer concept of a wildcard is just as the name suggests, it could be practically anything. More specifically, in this context, the asterisk (*) character represents "any quantity of anything." In the context of fields (the "select star"), we are stating that we want any column available to us, no matter how many of them are there. A table with 1 column would give us 1 column in the results, and a table of 10 columns would give us all 10.

We can apply a similar concept to strings. The string 'Man*' can be read as "anything starting with 'Man'." Similarly, '*Man' would be read as "anything ending in 'Man'," and '*Man*' would read as "any string with 'Man' somewhere within it." Getting too abusive with wildcards can create unpleasant results, as well as having an adverse effect on performance of queries, but there are many occasions where you may know part of a string but not the whole thing. In cases where we have partial string criteria, we say a string is LIKE another string.

LIKE? What happened to equals (=)? Well, there may be legitimate cases where you want to return string values that have an asterisk in them. Saying

SELECT *
    FROM sal
    WHERE City = 'Man*'


... would literally look for every record whose City is Man*. We don't have any such value, so we'll end up not getting anything back. If your purpose above was to look for everything starting with Man, then your choice of comparison operator (=) would have been incorrect, as we're being explicit in our comparison against Man*. Instead, using LIKE as an operator would give us the desired result:

SELECT *
    FROM sal
    WHERE City LIKE 'Man*'


LIKE only works with text data, but it can be extremely powerful if used properly. However, you should have a very clear idea of what data you're working against before leveraging this operator, as you could just as easily get cities like Manchester and Manassas from such a query. Improper use of such a loose comparison is a leading cause of "flaky" results.

One of the ways to keep such a query from getting out of control is to add additional criteria. If we know for a fact that the only city starting with "Man" that has an Aspen Drive is Manhattan, we could bunch that condition in as well:

SELECT *
    FROM sal
    WHERE City LIKE 'Man*
        AND Address1 LIKE '*Aspen Dr*'


Note that I have asterisks on both ends of the address. The idea behind this is that you could have any number of things (house number, cardinal direction, etc.) at the beginning, and the way "drive" is written could appear as "Dr", "Dr." (with a period), or "Drive" -- and you may wish to see any of the above. However, just remember that every wildcard you add to the query will have an effect (trivial in this case, but in major queries it could add significant time) on performance. Having said that, you'll do what you must to get by, and this is a valuable tool for your arsenal.

Okay, I've been making things too easy up until now. Let's step it up with a third criterion:

SELECT *
    FROM sal
    WHERE City LIKE 'Man*'
        AND Address1 LIKE '*Aspen Dr*'
        OR FirstName LIKE 'Sam*'


See any problem with this query? Technically, it will run, and will give you values back, but can you tell just by looking at it WHICH values it will give back? The issue here is that we have two different boolean operators (AND and OR), but how do we know what order they're evaluated in? Are we saying to first determine the City AND Address1, and then after that say OR FirstName? Or, are we really saying to compare City first, then after that we take either the Address Or the FirstName?

The issue I'm describing is "operator precedence." Knowing whether AND has a greater priority than OR, and applying the same to other operators, can make the difference between the CORRECT query and a BAD query. While we may be syntactically correct in how we wrote the above, for readability purposes, we need to polish it up a bit. Let's just assume that we ALWAYS want to have a City like 'Man*', but we'll take either the Address1 or the FirstName condition. The use of parentheses, much like what you'd see in algebra, brings things together nicely:

SELECT *
    FROM sal
    WHERE City LIKE 'Man*'
        AND (Address1 LIKE '*Aspen Dr*'
            OR FirstName LIKE 'Sam*')


Don't mind my indentation too much, it means something to me but you'll have to do whatever feels right to you. To me, the OR being deeper than the AND tells me that we look at the AND before we do the OR. However, indentation has no effect on how the query is executed. Instead, we're relying on the use of the parentheses to do this for us. If you're still confused, consider the following facts:

- There are three separate criteria. We'll call the City criterion "X", the Address1 criterion "Y", and the FirstName criterion "Z"
- Either Y OR Z can pass, but AT LEAST ONE of the two must pass
- X must pass no matter what
- Written in boolean algebra, (+ for OR, and * for AND) this is X * (Y + Z)

Truth table time:

 X   Y   Z   X * (Y + Z)
--- --- --- -------------
No  No  No       No
No  No  Yes      No
No  Yes No       No
No  Yes Yes      No
Yes No  No       No
Yes No  Yes      Yes
Yes Yes No       Yes
Yes Yes Yes      Yes


Nested conditions like this (having a condition within a condition) can quickly hurt the readability, so having a good way to keep track of how deep you are and how many parentheses you're working with is important, particularly when dealing with functions. The way I deal with this, if I know beforehand what the number of conditions will be, is I'll actually write the open and close parentheses () first, then fill in the blank. This way, I don't run the risk of forgetting to close them. If readability is a concern, adopt a good indentation or commenting technique to help you keep track of where you are. Also, it might make more sense for you to keep multiple clauses and statements on the same line, whereas I make an effort to divide them up as often as possible, but either way is fine so long as it makes syntactical sense.

An example of deeper nesting could be something like (using the above example as a base):

SELECT *
    FROM sal
    WHERE City LIKE 'Man*'
        AND ((Address1 LIKE '*Aspen Dr*'
                OR FirstName LIKE 'Sam*')
            AND Address1 NOT LIKE '*son')


If you consider our new criterion (Address1 NOT LIK '*son') AS W, keeping the original substitution variables from above, we get the following boolean algebraic formula: X * ((Y + Z) * W). If you know your boolean algebra, or even just regular algebra (there's a reason boolean algebra uses this notation), it should make sense that you could make the formula cleaner by changing it to W * X * (Y + Z), so an equivalent (and easier to read) statement would be:

SELECT *
    FROM sal
    WHERE Address1 NOT LIKE '*son'
        AND City LIKE 'Man*'
        AND (Address1 LIKE '*Aspen Dr*'
            OR FirstName LIKE 'Sam*')


While a strong sense of boolean algebra is not required, it can be VERY beneficial to a SQL writer's bag of tricks. Minimization of formulas can become a science, but there's a reason many computer engineers/scientists are required to learn these kinds of things early on in college. I would strongly encourage any up-and-coming SQL expert to devote some time in this area if they are lacking.

Well, in closing, the things you should take away from this lesson are the LIKE vs. = (equal to) operators, the use of parentheses to distinguish between priorities for condition evaluation, and a sharp eye might even see the NOT operator in use.

Some homework I can dump off onto you would be:

  • Read up on boolean algebra
  • Consider the other "wildcard" character, the underscore (_).  Where the asterisk allowed for any potential number of unknown characters (0 to ?), the underscore specifically means "any one and ONLY one character."  Saying WHERE bob LIKE 'ab_de' will work for 'abcde', 'abede', 'ab de', but NOT for 'abde' or 'abarde'.

Saturday, April 4, 2009

Basics - Part 1: A SELECT statement

SELECT 1

There, that's the simplest query I can think of (you're welcome to replace 1 with NULL, 'BOB', or any other value of choice).  Oracle users may or may not need to change it so that you're selecting FROM DUAL.  If that's your cup of tea, you're welcome to MSN the search phrase "select dual," or if you're not from the Microsoft camp, you can also "google" it.

Seriously, though, that's pretty useless.  What you get from that statement is exactly what you ask it for: the number 1.  I'll next present a vastly more complex example, using the theoretical table named "bob":

SELECT *
    FROM bob

I'll break down what I've just done above in piecemeal.  First, we're still using that SELECT statement.  In a nutshell, if you read "SELECT" as "GIVE IT TO ME" you'll be okay.  Next is the *, which tells SQL that you want it ALL. You don't discriminate, you just want to see every column the contents of your FROM clause have to offer.  As you can probably infer from context, the FROM clause indicates the source of your data.  Assuming your table "bob" had the following columns (data fields):

  • MyName (text)
  • MyAge (number)
  • MyDateOfBirth (date)

... and you had three rows (records) contained within, you might get results resembling the following:

MyName         MyAge       MyDateOfBirth
--------       --------    ----------------
Smith, John    12          1996-04-04 16:45:00.000
Adams, Ted     98          1910-02-12 12:00:00.000
Davies, Gus    65          1942-06-23 02:16:23.000

I'll touch on it at a later point, but the data types (text, number, date) are probably actually stored in the database differently than I present them.  Also, I introduce some formatting in my SELECT statement above, in the form of indentation, capitalization, and splitting it up into multiple lines, but you could just as easily cram the whole thing into a single line (SELECT * FROM bob).  Do whatever makes it more readable to you, but you'll start to see more of my formatting quirks as we get into more complex examples.

Okay, the fine print is out of the way, so let's look at our results.  We got back everything that was in that table, which is cool considering that we only had 3 columns and 3 rows, but what if we had an insurance claims database with thousands of claimants, and millions of claim lines?  I can assure you that even with the best and brightest computers, you're putting a strain on it by doing a SELECT * (I read that aloud as "select star" for the record).  Why in the name of Bobcat Goldthwait would you EVER want to see EVERY SINGLE RECORD like that?

Well, this is where criteria come into play.  Let's take a peek at the standard syntax to be found in almost every SQL statement you'll encounter in real life:

SELECT (such-and-such)
    FROM (here-and-there)
    WHERE (this-and-that)

To put that into English, what I just wrote could be read as "give me such-and-such from here-and-there where this-and-that equals this-and-that."  Applying this to our earlier example, suppose we wanted the names of people from "bob" whose ages are less than or equal to 50.  By applying this criterion to our statement, we get:

SELECT MyName
    FROM bob
    WHERE MyAge <= 50

We can see that there are a couple of significant changes here.  First, we're limiting WHAT we're grabbing by changing the wildcard to a single field (MyName).  After all, we only want to know the guy's name.  If we wanted more, we could add additional fields by separating them with commas (e.g. SELECT MyName, MyAge, MyDateOfBirth).

The other change here is the introduction of the WHERE clause, in which we specify the filter criteria.  If you recall, of our original 3 people, only one had an age of less than or equal to 50, Smith, John, so when we do our query we see the following:

MyName 
--------
Smith, John

The benefits of adding these two changes are many:  reduced bandwidth (assuming the SQL server you query is not the computer you're doing the querying from), less memory consumed, the user is not inundated with too much information, and probably some others I don't care to list.

What if we wanted to introduce additional criteria?  Let's add four more people to our table bob, so we get the following when we do a "select star" on it:

MyName         MyAge       MyDateOfBirth
--------       --------    ----------------
Smith, John    12          1996-04-04 16:45:00.000
Adams, Ted     98          1910-02-12 12:00:00.000

Davies, Gus    65          1942-06-23 02:16:23.000

Willis, Tad    82          NULL

Smith, Jim     16          1918-12-31 06:82:00.000

Rich, Nick     32          1963-07-04 12:00:00.000

Baggins, Bilbo 111         NULL

See anything new?  You should:  the NULL value.  It means different things to different people, but you'll quickly discover that it can do some damage to your query results if you're not prepared for it.  You can equate NULL to "not defined" if you like; in the context above, it could just as easily mean "don't know, don't care."  The important thing to know is that NULL never equals anything else, not even another NULL.  Thus, the statements 3 = NULL, NULL = 'BOB', and NULL = NULL are never true.  You can never say anything equals NULL, but as you'll discover later, you can say something is NULL.

If I do a SELECT * FROM bob WHERE MyDateOfBirth > '1900-01-01', you'll get everything from the revised table EXCEPT the values where the date of birth is NULL.  This is because we tried doing a direct comparison against values that don't exist, and like I said, you just plain cannot do this with anything except the IS operator.  If you changed your WHERE clause to read WHERE MyDateOfBirth > '1900-01-01' OR MyDateOfBirth IS NULL (more on multiple criteria in a moment), you would get the NULL values back, because we're doing a comparison against nullity at the same time we're doing a value equality comparison.

So, back to that second criterion.  Earlier, I said that you could specify multiple columns in your returned data by separating them with commas.  Well, criteria are a little bit different, because you can have AND and OR criteria.  I would encourage anybody confused by this concept to do a search for "boolean logic" before we proceed any further, because without knowledge of that concept you will NOT (another boolean operator) go very far in SQL.

Before closing for the day, I'll follow up with some homework for my readers:  go out and MSN (google) the following phrases:

  • t-sql select
  • boolean logic

Also, I'll present some of the more common criteria comparison operators as well:

  • = (equals)
  • <> (not equals; don't ask me who concocted this thing, but VB users might recognize it)
  • > (greater than)
  • >= (greater than or equal to)
  • < (less than)
  • <= (less than or equal to)
  • NOT (pretty obvious what it means, probably not as obvious how to use it, so I'll explain this in a later post)
  • IS (used with comparsions involving NULL values)

My take on what SQL is and why you should care

Sometime early on in college, I had the fortune of seeing the movie "Office Space" and realized that if I continued going to school to be a programmer, that this was going to be my future (minus the humor).  Being an Iowa resident, I knew that programming meant one thing: a future in database development for an insurance company.

Well, fast forward a few years, and I'm doing database development.  For an insurance company.  However, I'm pleased to say that not only is that not my full-time job (I'm contracted out to said company), but it's actually quite fun.

I'll repeat that last part.  Database programming is fun.  Now, purists might attack me for using the word "programming" in this context, as query writing doesn't necessarily imply that software development is associated with the process.  Additionally, some might even attack my definition of programming, which I would assert means to provide machine-interpretable instructions to invoke an action.  What?

Back to the whole "fun" business.  I call query writing (SQL) fun, because it deals with things the common smarter-than-the-average-bear person would find interesting:  puzzles, numbers, aesthetics, and puzzles.  I repeat puzzles (again, that word) because in any situation requiring more than the most mundane/trivial of queries, you are given a problem and a toolset for solving it.

Every person is good at some kind of puzzle, with enough practice.  Classic examples are Rubik's Cubes (and the less venerable Rubik's Clock), card games, chess, or balancing chemical equations (okay, I don't think that last one qualifies as fun to many people, but to each their own).  SQL queries themselves present excellent opportunities for puzzle fun as well, and with either the knack for and/or practice, you can excel at SQL puzzles too.

The rewards for solving a SQL puzzle can be great.  When dealing with large amounts of data, requiring analysis and calculation, SQL engines (well, in my experience, this can be said of MS SQL Server) can crank out results efficiently and reliably.  I speak from experience in saying that you can go from days' worth of calculation (said example was written in VB6, using a cursor-based loop of if-elseif-then statements) to mere minutes in SQL.  The SQL written to achieve this spans probably something close to 100 lines of [formatted] SQL, occupying two large statements, and with what I've learned about SQL since that I could probably rewrite it to work in one statement, taking up roughly 2/3 as much space and 80-90% of the execution time.  How is this possible?  The short answer is that SQL engines are DESIGNED to do this kind of work, whereas Java/C[#/++]/COBOL/etc. can do it but with a blow to efficiency.  A well-written statement (with an emphasis on optimization techniques) can fully utilize a multicore machine's CPU potential (I regularly see 100% CPU utilization on a dual-quad-core Xeon setup).

Before I close, a few things about my SQL blag.  First, most people pronounce SQL as "sequel."  I hate this, largely because it is completely meaningless to me.  I mean, where is this thing's "prequel?"  Instead, I like to call it "squirrelly" because of the sometimes convoluted, often eccentric statements that arise from my problem-solving efforts.  I have a strong emphasis on my own particular brand of formatting, but you'll come up with your own, as you try to make that big "mess" you've written maintainable and readable.  Also, I'm strongly T-SQL oriented, being of the Microsoft/Sybase school of thought, but the techniques I write should hopefully help a person master other SQL variants as well (e.g. Access, MySQL, ).  I'm liable to step on a few toes as I move forward, but if you can overlook my oft preferential treatment towards one of the most heavily scorned companies (.NET and MSSQL are my bread and butter, after all), you'll hopefully master SQL in your own unique way, even as I continue to grow in my repertoire of tricks.