Thursday 27 November 2008

The future of query optimizers

Future performance optimizations


While learning about the Oracle 11g invisible index feature, I really became aware of how complex the modern cost based optimizers have become. Essentially, Oracle has added a feature to its cost based optimizer that allows you to make an index invisible to its optimizer.

This brought to me fond memories from when I had to revisit an already tuned statement, as performance was mysteriously been dropping over the last few days, or hours, without an associated increase in data volumes or system loads.

In almost all cases, and after some investigation, I realized that the previous statement access plan was no longer optimal. This was always due to one of those: either the size of the result set was drastically different, the cardinality of some of the columns involved had changed, or finally, the data volumes, while not having increased in a meaningful way, made some of the assumptions of the CBO invalid. In all cases, the fix was always either running again the statistics collection utilities or rewording the statement, maybe adding also optimizer hints, so that a different access plan was used.

And yes, having at those times a feature that made indexes invisible to the CBO would have been nice. Very nice in fact. But have we gone too far?

Too clever and too rigid?


At the time, I just accepted that this was a shortcoming of the CBO, for all its advantages one could not reasonably expect the CBO to always get it right. What I was being paid for after all? No system is perfect and the Oracle CBO is specially good at picking the best access plan. Very seldom I've had to use optimizer hints or other tricks force it to use a better alternative, and for most of the work in an application (famous 80-20 rule) the CBO does a good enough job.

But recently I was involved in some work for a content recommendation system.

Initially it was only a performance problem, but after becoming familiar with the code and algorithms, I was able to improve the recommendation accuracy as well as doing one of those spectacular "14 hour process converted into a 0.02 sec on line query" feats.

Now, I'm no expert at statistics, and I sometimes have tried to understand all the complexities of the NetFlix submissions and would need six months to even understand the finer points of all the algorithms. But one thing became clear: all of the try to predict future trends based on the past. And the best algorithms even adjust hemselves over time to correct their own mistakes. Of course, all of them reach a point where they cannot get better accuracy and their results stabilize. They will never be able to reach 100% accuracy, as their mistakes when predicting are also their bases for improving their predictions.

Sorry for the digression, what does this has to do with database performance, after all? Somehow after reading the Oracle specs I was under the impression that Oracle had created a fix for a problem that their CBO has created in the first place. By the way, if you check the documentation of other RDBMS, for example MySQL and Postgres, you'll find that they don't have a similar feature (so if you're doing one of those feature list comparisons, here's one for Oracle) Even the fairly good, sometimes even better than Oracle, Postgres genetic algorithm for selecting an access plan does not allow you to selectively "hide" its indexes. In short, Oracle allows me to protect the CBO from itself. This lead me to think, why this is necessary? Should the problem have not happened to begin with?

Tuning approach today


More or less, the standard way of approaching database tuning always start with the 80-20 rule. Simply said, for 80% of the application, performance is perfectly acceptable. In those cases, nobody is willing to invest time in improving performance, because there is no real benefit in doing that. What we should not forget, however, is that the database in fact could be performing very poorly in that 80% in terms of resource usage. But as business priorities are what drives the focus on performance improvement areas, it does not make sense to look at them.
But, and this is important, that does not mean that we're wasting resources there.

I predict that with the recent rise of application service providers and cloud computing, the issue of utilization ratio of large farms will be important in the future. Let's say that I have a cluster: if it is being used at 50% capacity and I can make it magically 10% more efficient that's really going to make a 10% improvement in my ability to serve more customers. Note that 10% is equally applicable to the 80% that performs satisfactorily from a business point of view as well as the usually problematic 20%.

But it is not practical to even try to tackle the 80%, simply because of the sheer size of the job. As applications are growing in size and complexity, no single team of manageable size can even think about doing such systematic analysis. Simply looking at thousands of access plans is not viable. We must trust the CBO, it's our only way of keeping our ability to grow on application complexity.

The four traps with current optimizers


Access plan generation is one of the features that differentiate RDBMSs in the market. One that is probably very complex to implement, as theoretical principles and real world experience must be merged to produce something that is "almost" always right. And by "almost" we should read "good enough in most cases". Unless you're dealing with really tight constraints, It's really not that relevant if a SQL statement takes 0.01 or 0.02 seconds to execute, as the difference usually does not have a major impact in overall system performance (again, warning, of course there will be extreme cases where this will actually be very important) What is important is that the database maintains a consistent, real world reflection of the complexity and data volumes that it is handling. One finds reasonable that a statement that deals with small data volumes executes quickly, and statements handling million of rowws are completed in a reasonable amount of time.

And current optimizers do quite a good job at that. But they fall into four traps.
  • The context trap. As I've said many times, they don't know the business context of the data, nor the business priorities. So they cannot say "nah, no problem in not using this index this time. The system is not loaded at the moment, I have power to spare and will finish a bit later than usual, but nobody will care about this" or "hey, it's really critical that we execute this process in two hours, go head over heels if necessary and make it faster" The on site presence of some
    performance specialist is usually a symptom that a database has fallen into this trap.

  • The complexity trap. As seen by the Oracle example, sometimes they try so aggressively to optimize that they overlook opportunities for optimizing. The presence of optimizer hints is usually a tell-tale sign of this trap.

  • The change trap. An access plan is expensive to generate, and is usually cached to avoid having to paid the cost of generating it again. This means that the plan is not able to change should the cardinality, data volumes, etc, change. The existence of features to hide or avoid to use data structures whose only reason of existence is to improve performance is a symptom that the system is into this trap.

  • The 80-20 trap. As I've said, some of the most frequent operations can
    in fact be under optimized, but as long as we don't have any feedback from system owners, we'll never notice or care.

Something is wrong here, and I just realized that the optimizers should learn from the content recommendation systems. As far as I know, the optimizers currently in use generate an access plan and then store it to be reused later. None of them try to verify that their assumptions were right, or that the resource consumption was in line with their expectations. In short, none of them learn from their mistakes.

Adaptive access plans


This, I predict, will be the next step in cost based optimizers: the Adaptive Access Plan. In the future, databases will still keep a log of the statements they execute, but they will add to that information about how long took to execute with a certain parameter set. They will keep track of how many reads were made, how many of them were from the memory cache. They will analyze that history and adapt their access plans to the data volumes, system concurrency factors, previous execution times
and available resources.

In that bright future, we'll see performance improve over time as the system adapts its access plans to changing data patterns. Databases will tune themselves, not merely suggesting different ways of running, but even running in parallel different access plans and choosing the one that performs best for future executions.

Will this be the end of the database tuner? Probably not. The context trap, unless some sophisticated Turing test compliant software is created, will remain the realm of human beings.

PS: I'm sure that there is people playing around with the idea of adaptive access plans, and some of them even are thinking of getting a patent granted on it.Consider this article as prior art.

Friday 14 November 2008

More on tuning tools from industry luminaries

It has been a long time ago since my post about tuning tools, but I found a very interesting paper written by some of the best people in the database world. There is a point where they say

"... Moreover, at the current time, the automatic tuning aids in the RDBMSs that we are familiar with do not produce systems with anywhere near the performance that a skilled DBA can produce. Until the tuning aids get vastly better in current systems, DBAs will turn the knobs."

(by the way, I'm still in the process of digesting all the very interesting content available there. If you want to have a broader view of where the database world is heading to, the vldb papers are a good place to start)

While this is missing the why part of the question, which I argued that those tools don't know the context well enough to tune as efficiently as a skilled person does, it is really comforting to know that I'm not the only one with that opinion.

A missing off-topic post

This post was going to be quite large, and hopefully entertaining. This post took me the good part of an hour to write and was, I supposed, interesting to those upgrading or about to upgrade to Ubuntu 8.10

Unfortunately, this post (well, its first version) was written as a mail message to a @blogspot account. Of course, and given the volume of mail that I handle on my day to day job, I usually clean my send items folder as soon as the message is sent. Which is what I did with the previous version of this post.

Only to find later that something is broken either in Blogger system to handle incoming mails as draft posts or I've not understood yet how it works. Which means that this post was lost.

So you'll have to make do with a brief summary of what was previously a quite entertaining story about my adventures upgrading my desktop Ubuntu from 8.04 to 8.10. Because in no way I'm going to rewrite it, I probably could not do it anyway as my writing stamina for the week got nearly exhausted.

But in summary, if your hardware is less than three years old, probably your upgrade to KUbuntu will be a good experience provided that your home network setup is not very complicated or you have previous experience tweaking by hand configuration files. In that case it's not going to be that bad.

But KUbunto 8.10 is not for revving up mom and pop five year old desktop. Not because the distribution is not optimized or KDE4 is not fast, but because there has been a surprising lack of attention to the essential details that make it user friendly in the same way that 8.04 with KDE 3.5 is.

And I hope to someday learn how to create draft blogger posts by sending mail...

(Update: I found the problem, and it was entirely my fault. I sent the message to a @blogspot.com address. Had I read more carefully the instructions, I would have sent it to a @blogger.com address, which works as advertised. The mail handling is probably so perverted now by the spammers that the @blogspot.com server does not even send an error response back when a wrong address is used)

Wednesday 22 October 2008

Optimizing MySQL data loads

Classic Mystery


Sometimes, performance problems can be like a classic mystery novel. Hey, just a bit of fantasy sometimes help to get the stage. Here we are, surrounded by the usual suspects, and pondering which one is the most likely to blame. And then the underdog detective comes up pointing to the least likely person to have committed the murder. Of course, all the police detectives that have been tracking the evidence scattered around the crime scene and changing its opinion all the time about who was the most likely person to blame during the book, nod in disagreement and manifest their disbelief.

Of course, the characters themselves, had their read some books of this type before, would know that they were wasting time all along. Everything they've done so far has only created reasons to discard all the usual suspects and almost reach the end of the book with no good solution to the problem.

But somehow, and perhaps with a bit of cruelty, they allow the strange person in the room to explain the logic behind. What they really want is to have a bit of relief from all the tension and stress that having to face an unsolvable mystery in your hands create. After all, they want to see how the rest of the world is as lost and clueless as they are, or even more. When you start your explanation they even refute and joke on each fact that you state, and about how absurd your reasoning is.

However, after the first laughs they shut up and listen, as facts begin to stack one on top of another, without any cracks in the logic. At some point, the audience seems mesmerized, following each sentence in silence, their minds trying to anticipate what is going to be your next logical conclusion.

And suddenly, it all makes sense. Everyone is convinced that should be right. Of course, when the real culprit faces detention, he (or she) either gives up and confesses or runs away.

Database tuning mysteries


Performance tuning, while less dramatic, sometimes has all the ingredients of a classic mystery novel. Well, except for the runaway (of course, there is always the possibility if one cannot reach the customer performance targets, but it's a bit more professional to not bill them in that case) Also, there are no documented cases of databases escaping from their tuners. But the best tuning jobs usually are the ones where the best solution for the problem is the least unexpected one. Talk about lateral thinking.

Recently I engaged with a potential customer that wanted to improve the load times of a text file into a MySQL database. This potential customer had a very well defined problem. A text file containing roughly 110.000 records had to be loaded frequently on its database.

The crime scene


Data was loaded into the IMPORT_DATA table. For each record on that table, a comparison was made to check if the record met some conditions. If true, the data record was stored in the DATA_VERSIONS table, looking it up first to see if there was an existing record with the same values. In each case, either the LATEST_VERSION was updated with the current timestamp of the day to reflect the latest time this record was received or a new record was added to LATEST_VERSION with the current timestamp and the primary key of the new record.

The net result is that all records ever received from the customer that met the conditions to be interesting were stored in the DATA_VERSIONS table, with a timestamp in the LATEST_VERSION table to reflect the last time they were loaded.

The baseline time I have from the basic operation was around 10 minutes. The customer wanted to optimize the load to be as fast as possible. He was talking about creating table partitions, using the BlackHole engine (more on that later), tuning its memory allocation, index structure, whatever.

The data was loaded from a plain text file, one line per record, with fields separated by the pipe (|) character.

The essential clues


The way all this process was done was curious, at least for me. Each time a row was inserted into the IMPORT_DATA table, a trigger was fired. The trigger in pseudocode read something like:


If RecordIsInteresting Then
    Look up record in DATA_VERSIONS
    If found
        Update LATEST_VERSION for the primary key record with timestamp
    Else
        Insert record into LATEST_VERSION
     End If
End If



The condition at RecordIsInteresting read more or less like FIELD_1 = 'A' AND FIELD_14 = 'B'

False suspects


Mmmm... By this time, you'll probably already have a good mental model of the process. And also, if you've faced situations such as this one, I can almost hear your thoughts. Hey, let's check the look ups. Was the DATA_VERSIONS correctly indexed for the look up of the just loaded record? Was the LATEST_VERSION table indexed also for the existence check? Was the IMPORT_DATA table emptied before the load started?

A good start, then. But it turned out that the answer to all those questions was yes. Everything was correctly indexed, access plans looked optimal. I even learned that MySQL is pretty efficient using the LOAD DATA ... REPLACE, as the timings were not that different versus LOAD DATA ... APPEND. Even so, the data volumes loaded were within the reach of machine CPU and RAM so no resource seemed to be exhausted. Repeated tests with detailed checking of operating system statistics did not reveal any particular bottleneck. Machine resources were used efficiently, and the CPU was not waiting idle for the disks to read the data.

So all the usual first aid measures were not going to help. This is were the fun really begins, as improving performance in this way would not be anything to write home about (much less a complete novel) But then, mystery novels had those characters there just as a vehicle for the reader to discover the essential facts about the crime.

The conventional thinking


Enter the expert and veteran detective that is looking at this case as his golden opportunity to become a media star. As clearly something has to be done to solve the mystery, the veteran takes a long and hard look at the problem. Of course, he examines all the false starts. And after a while, he has an idea. After all, why use a trigger to process the records one by one when all of them can be processed at once? This is a classic optimization technique that I've already covered. After all, what's the point of doing it record by record if you can do all of them in one single pass?

So the expert sits down and writes something like



INSERT INTO DATA_VERSIONS
SELECT
    *
FROM
    IMPORT_DATA
WHERE
    RecordIsInteresting
ON DUPLICATE KEY UPDATE
    TimeStamp = Now

INSERT ON LATEST_VERSION
SELECT
    *
FROM
    IMPORT_DATA
WHERE
    RecordIsInteresting
ON DUPLICATE KEY UPDATE
    TimeStamp = Now



With a smile, the expert disables the trigger on the IMPORT_DATA table, loads the data and executes the two sentences. Before claiming victory, he checks the timings. Now, he's down to 8 minutes! What has happened? He checks that the RecordIsInteresting condition is covered by an index (it is!) Not very proud of his achievement, but satisfied nonetheless as he has won were others have failed, he announces that his improvements deliver 20% better performance.

Lateral thinking


Enter the underdog. Great, he says, at least we know something. Wait, what exactly do you mean? - answers the expert. Of course, we know that MySQL triggers are pretty efficient. It's only a 20% overhead, which means that MySQL developers have done quite a good job in their trigger runtime. But, have you asked yourselves if the customer is going to do anything with the IMPORT_DATA information after the load?

- Why? - answers the expert. Of course, he does not want to do anything with the IMPORT_DATA table. That's why he was thinking of using the BlackHole engine for this table, since his only interest was in the data during the trigger execution. Of course, my optimization will not work with the BlackHole engine, as it depends on the data being available to be processed in a single pass.

Realizing what he had just said, the expert creats the IMPORT_DATA table with the BlackHole engine and tests the original data load timing. Well, it now takes 9 mins to load, meaning that he's still able to claim his 20% optimization. Relieved, he then proceeds to probe the underdog that the time it takes MySQL to actually store the data is not that relevant after all.

- Oh, yes, I see, so the IMPORT_DATA information is useless after the load. Can I do a quick check? - asks the underdog.

The underdog bangs away at the keyboard for ten minutes or so, then he claims - See, you can load the data in 1 minute. That's it.

The expert is amazed. At first, he does not believe it. He checks and rechecks the resulting LATEST_VERSION and DATA_VERSIONS tables and everything is correct. Well, now, please show us what amazing MySQL technique you've used, he says.

The underdog is, at heart, a shy person. So it takes him a moment to sort out his argument. It's not that simple. I had to check how many interesting records were in the table. As it turns out, only 10.000 out of the 100.000 records loaded met the RecordIsInteresting condition. So I just said to myself, what if I dont' load those at all?

- But how do you know which record is interesting or not without loading them?, asks the expert
- Well, I wrote this awk program to filter the file before loading it.



# Records are separated by |
BEGIN { FS = "|" }

# First line is for headers, keep it to not having to change the MySQL LOAD DATA statement
NR == 1 { print $0 }

($1 == "A") && ($14 == "B") { print $0 }



- The awk program takes two seconds to execute and creates a file with just the 10.000 interesting records. If you load that file instead of the original one, the results are exactly the same. The only difference is that you're saving MySQL to read a record from a text file, splitting the line into field values, filling the temporary buffers necessary, firing a trigger and evaluating a condition 90.000 times. You only use MySQL to process the records you already know that are interesting.

... end of the story. The real mistery is that my real customer never heard of this solution, as I lost contact with him just after we reached an agreement on the pricing. I don't know where he is, and I hope that he does not mind if I use his problem as an excuse to poke fun on classic mystery novels.

PS - I had very good fun when I was young reading those novels. And I must admit that I never, ever was able to guess who was guilty in advance.
PPS - of course names have been changed to protect the innocent

Wednesday 15 October 2008

The Google Application Engine

In the beginning, there was timesharing


In the beginning, there was timesharing. This was in the early, early days of computing, before most of us were even born. At the time, computers were big, expensive, complex and extremely difficult to operate. Not to mention underpowered in comparison with, say, a domestic media player. Since most business did not have the resources -read money- or even the need to use one of those monsters full time, you basically rented time from the machine. This model of using applications was perceived as a win-win, as the customer did not have to pay the up front costs of maintaining all that infrastructure and the supplier could maximize its return on what was at the time, a significant investiment.

Over the years, as demand for usage of computers increased and prices went down, more and more business found that it was profitable in the long run to own and operate the monsters by themselves. Thus, the timesharing business model gradually faded, althought some parts of it survived. Specially for the big mainframe iron, IBM still continues to "rent" their software and "cap" the CPU cycles that a modern mainframe uses based on a monthly fee.

I'm not sure if this model is still used literally. I'm sure that 10 years ago it was still in use (anyone remembers the defunct Comshare company?) But as most everything in the computer industry, the model has not survivied literally, it has morphed over time while keeping the same conceptual underpinnings.

Cloud Computing


"Cloud Computing" is the latest industry buzzword. Basically, it means that when you use applications, those applications are not running in hardware you own, you don't even know what hardware is used. It also means that the provider of those applications has to worry about availability, scalability and other concerns that were responsibility of the company IT department. It also means that the provider can take advantage of economies of scale, by for example reusing transparently its machine resources to support many different customers across different timezones. It can also spread geographically its machines to save on communication costs and improve response times, and apply other optimizations.

So, are we seeing the return of timesharing? Some would say that yes, this is essentially the same model applied in the 60s, only using bigger, more powerful and cheaper machines. And yes, this is a repeat of the SaaS (Software as a Service) wave that was the latest industry buzzword five years ago. Of course there are a lot of similarities, but there are a couple of points that make things different now.

First companies adopting cloud services are making a choice from their current established infrastructure and the supplier offerings. They have experience, can compare costs and are much more conscious of the trade offs, limitations and opportunities that each model has. This was not the case with timesharing, since the costs were so high that none could really afford to supply internally the same computing resources. This time it's a choice, not an obligation.

Second, some software categories have become "commoditized" For some software categories (Mail clients, Office productivity apss, even CRM) one can choose from different offerings and there is a clear relationship of the price/features ratio. The timesharing of the past was almost always done on either higly specialized number crunching applications (hardly classificable as commodities) or in completely custom built business software.

The Google Application Engine (GAE)


In the past months we've seen the launch of Amazon E3, a notable offering that is proving to be a cost effective alternative to housing or hosting providers and also have the resources to scale should the needs of your application increase. As a counter offering, Google has launched its beta Application Engine. Its main difference is that Amazon has split their offerings into infrastructure categories, such a storage service (S3) and a "super hosting" service (EC) whereas Google's offering is centred around an application hosting environment that includes more built in facilities, but is also more restricted than Amazon's.

Basically, Google offers to share its enormous infrastructure, for as yet undisclosed price. The advantage is obvious, Google has proved already that they can scale to a massive number of users and storing massive amounts of data without disruptions and without degrading the quality of the service provided to the users. They have been developing and refining their own technology stack for years, all of that with the eye on massive scalability. It's very unlikely that anybody outside Yahoo, Amazon or Microsoft can match that, much less with limited resources in comparison with what Google can put together.

Google has not yet provided all the details, and the service is still in beta. But some of the main points of the offering are already apparent. There are already lots of reviews where you can find their strengths and weaknesses, written by people much more knowledgeable than me at web development. But for me, the interesting part is their storage engine. Google provide an abstraction called the "datastore", that is, an API that allows programs hosted in the App Engine servers to store and retrieve data. You don't have any other means of accessing the data persisted than the datastore API, so you better familiarize with it before delving further in any application design.

Google datastore, limited?


The most striking feature of the datastore is that neither it's relational, nor it tries to be. It's a sort of object oriented database, where Python (the only language the App Engine supports so far) objects defined in terms of Python classes are stored and retrieved. Optionally, indexes can be defined on properties, but an index only indexes a single property. Primary keys are usually automatically generated, and relationships between objects are handled via "Reference" types.

The datastore can then be queried and objects retrieved. The Google Query Language (gql) has familiar SQL syntax, but outside of the syntax is a completely different animal. It neither allow joining entities, nor to use very common combinations of conditionals in its WHERE clause. There are no analytical functions (COUNT, SUM) and there is a limit of 1000 objects per result set.

Is this a rich environment for building data intensive applications? It all depends on your idea of "rich" and "data intensive" Of course all the missing features can be worked around in some way or another. But, and this is the difficult part, probably they should not be. By that I mean that anyone can provide iterative code to perform the joins, aggregations and whatever else you find missing from gql, but that does not mean you should do it.

For one thing, the GAE has strict limits on the CPU cycles that a request can consume, so you'll probably use them quickly if you write too much code. Those limits are there to protect Google shared resources (nobody likes its machine cycles being sucked up by the guy from the next door) So joining entities like you were joining tables is out of the question. Same for counting, aggregating and other time consuming operations.

Scaling in a scale that you cannot even start to imagine


All those limitations are, at the very least, irritating. Some of them even make completely unfeasible some kind of applications. At first, I was felt deceived by GAE. However, after browsing some more documentation I soon realized what GAE is all about.

Some background is in order: I live and breathe business applications. By that, I mean that I'm used to the scale of ERPs or Data Warehouses that span, at best (one would say sometimes worst) thousands of users. That's a formidable challenge in itself, and one that takes a dedicated army of engineers to keep running day to day. The usual mid sized system is aimed at a medium size business or a department inside a big corporation, never tops 150 users. And I've devoted a significant part of my professional efforts to study those environments and to make them perform better.

Now, Google plays in another different league. They provide consumer applications. While business users can be in the thousands, consumers are in the millions range. That takes another completely different mindset when planning and scaling applications. The GAE restrictions are not there just because Google engineers are timid or have developed an incomplete environment. They are there because GAE can make your application scale very big. Awfully big. With that in mind, you can start to explain to yourself why GAE datastore has so many restrictions. Those restrictions are luxuries that you can only afford when you can limit your user base to most a few thousands users. If you want to scale beyond that, you have to give up on some things.

Consumer applications are different


In short, don't expect to see shortly an ERP built atop of GAE. It's not the ideal environment for that. But it's equally unlikely that there is ever consumer level demand for your own ERP, isn't it? Think of GAE then as an ideal platform for consumer level applications. Then it starts to make sense. Think of Amazon EC as a super scalable hosting service, where you basically can make your application scale in terms of hardware, but you're going to be on your own when defining how to make that hardware perform.

Even so, GAE and related environments bring a new set of challenges, at least for me, in terms of how to efficiently store and retrieve data. I'll keep posting my findings, and I'm sure it will be a lot of fun to explore and discover.

Monday 15 September 2008

Persistence engines

How things came to be this way


Anyone involved long enough in software development ends up discovering a truth that is long ago known by other craftsmen and engineers: each problem is best solved with the tool more appropriate to solve it. Only problem for software is, it is such a young discipline that is all the time trying to tackle new problems not previously solved. Thus, sometimes it's very difficult to know beforehand if a tool is going to be adequate or not until you actually try it.

All software development tools, from assemblers to compilers to scripting languages, are a testament of this constant evolution of software to meet the ever increasing requirements of solving new problems using computers. It's easy to see that over time, as long as the more generic tools are developed and consolidated, more and more tools are created to attack more and more specific problems.

Structured data persistence was one of the most often attacked problems since the inception of the first computers, so at some point (Codd et al?) invented the concept of the relational database. This was so successful that the concept has survived to this day and is used extensively as a standard means of storing and retrieving data.

Reuse, encapsulation, inheritance and polymorphism was another recurring problem that software projects, specially at large scale, had to face. Thus, the object oriented programming model was invented and expressed in a number of different languages. Some of the more successful ones (C++, Java, C#) are the standard means of developing software today, usually supported by a large number of pre-built components that provide a number of useful features that can be reused across many different projects.

Sometimes when facing a new problem, software designers even find worthwhile to create a new tool just for the purpose of solving that specific problem. That's how small (and sometimes big) scripting languages are born and a lot of other tools (parsers, code generators, etc) are created. Other kind of tools that are also in wide use are "Frameworks", which are collection of software components that work together to solve a problem. The distinction between frameworks and applications lies mainly in that frameworks are designed to be used in applications, but they don't usually perform any function directly useful to the business objective of a piece of software.

Frameworks are also means of abstracting complex subjects into more manageable pieces. Specially with object oriented languages, properly designed frameworks allow programmers to selectively use the parts that they are interested in without having to learn the whole thing at once.

So, in a perfect world, developers have a wide range of options to choose form when creating a new application. In practice, this is not the case and often business standardize around a set of tools that are used across projects. Three factors conspire together. First, the complexity of tools mean that learning them represents a usually significant investment in time. Second, the biggest the number of skilled resources available for a tool makes much easier and cheaper to hire people for your shop. Third, humans have a natural tendency to avoid change and thus once you find something that has a reasonable price and works for you, you tend to keep it forever.

The end result is that usually applications are built using a set of components consistently across the same or many applications. One of the latest category of frameworks that have been consolidated are the persistence frameworks (or engines) Those are a set of classes that make your objects get stored or retrieved from a relational database without you (the programmer) even being aware of all the exchange among the application server and the database.

The object-relational impedance mismatch


Wonderful as they are, persistence engines have a problem: they are designed to store objects into relational tables. Objects have methods and data put together, plus (depending on the language) visibility rules. Tables are just tuples of data with relationships defined between them. (Digression: now it's time for me to hide from object/database purists looking to kill me since I've oversimplified concepts tremendously, but for the purpose of this discussion it's enough) Now it's easy to see that somehow in the translation, but suffice to say that the relational model is a subset of the object model, in the sense that everything that can be expressed with tables and relations can be expressed with objects. But not the other way, as you cannot express methods or polymorphism with relational structures.

Notice that there are some Object Databases that do away with the persistence engine and allow you to store and retrieve objects without losing anything on the way back. But, as you've guessed, these are not relational. While these engines someday may enjoy popularity, they are not presently part of the standard tool set in the majority of shops. And that's not likely to change in the near future. So for now we're stuck with relational databases. There have been some attempts, most notably from Oracle, to introduce objects in the relational database. While the technology does have its merits, it has never been very popular and the rest of the major industry players have not followed suit.

The performance compromise: how bad is that?


As always, there is no free lunch. For the persistence engine to work as transparently as possible, there has to be some kind of compromise. You give something, you get something. Since this is a blog about database performance, you are probably expecting me now to tell you how bad is it. Well, surprisingly not too bad. Provided that you use the persistence engine for what is intended to be used.

Here is the list of things that persistence engines are intended to be:

  1. A way to abstract reading and writing objects to your database

  2. A way to map your database tables into objects

  3. A way to perform basic operations (CRUD: Create, Read, Update, Delete) on single instances of objects (read: database tables)

  4. A way to abstract business process rules in your object layer instead of doing it in your database layer. This means that things that "sales that are above x limit should be approved by manager, and those above 2x should be approved by Finance Director" are much better easily expressed in object methods than in SQL triggers


  5. Here is a list of things that persistence engines are not intended to be:

  6. A way to avoid the relational database model, SQL language, or basic principles of database design. Sorry, you'll have to concede some things to your object model, because it is stored in a relational database. Please don't believe claims like the ones of the EVL design, they are simply not true.

  7. A way to solve every database related problem under the sun. It's always best to choose a simple engine that solves 80% of the problem that a complex one that solves the 100%. The logic here is that the remaining 20% probably has a better tool around.

  8. A way to forget about performance, the persistence engine takes care of that, does it?

  9. A way to perform analytical queries that allows you to avoid writing SQL


Please do not disregard persistence engines as useful tools just because of points 5-8. The strengths of the first for make more than enough for them, as most applications have most of their bulk in the advantages. In short, properly used persistence engines can make your application much easier to write and will not cost you too much in terms of performance. But the engine will not solve all of your problems. Special mention should be made of so-called analytical queries (those that have SUM(), GROUP BY, etc) Those are specially badly done using persistence engines, as they were not designed to do that. Consider that:


SELECT SUM(sales) as TotalSales FROM ORDERS WHERE CustId = xxxx;


Is hundred times more efficient than:

resultSet = engine.Query("SELECT FROM ORDERS WHERE Customer = ", xxxx);
totalSales = 0
foreach( rec in resultSet ) {
totalSales = rec.Sales + totalSales
}

Chainsaws are not good for juggling


The golden rule to avoid performance problems with persistence engine is to apply the same logic that you would apply if you were using a hammer. I'm sure that when facing a screw, even if you've pounded thousands of nails before, you'll not even think of using the hammer, would you? The moment you feel that you're forcing a tool to do something it's not suited to do is the time to stop and reconsider your choice of tool for the task.

In short, avoid fighting with the tool. When you start fighting with the tool, it's a symptom that you should stop and think.

Monday 28 July 2008

Please don't swap

While browsing the database forums, it has come to my attention that some people are looking closer at their database system memory management and trying to adjust its configuration settings.

Of course, the more complex the database the more parameters you can change. RDBMS's like Oracle for example have dozens of different parameters that can alter the memory usage of the database to better fit with your application needs. This is reasonable, and I've seen spectacular improvements on specific systems just because a single memory setting was changed.

As always, the usual rules of tuning apply here, change only one thing at a time and change only when you need to do it. And like any other tuning effort, the process should iteratively make a change and test it.

Different operating systems and databases have different means of checking the results of your changes. The most expensive ones like Oracle and DB2 have even dedicated packages from third parties that can help you monitor detailed usage of resources, CPU, disk IO, cache and memory usage. The less sophisticated ones may lack those capabilities, but usually the operating system has already enough facilities (/proc filesystem, Windows Performance Monitor) to give you at least a good overview of the resources that are being used and how much.

It therefore follows that each change is usually followed by a number of tests that try to reproduce the most stressing scenarios and evaluate the performance of each. Let me say that this is sometimes a nightmare. I'm happy with the scenario "I have this 2 secons query and I need to execute it in 0.02 secs" because that makes for quick and short tests, and those allow a much more rapid change/test cycle. But sometimes when you're dealing with processes that take eight hours to execute, it's a pain in the neck to have to wait from one day to another just to know the effects of a change.

One of the closely monitored parameters is always, or should be, swap file usage. But I've read some comments that somehow were implying that if your system was not using the swap file it was not correctly set up. Let me dispell that myth once and for all.

The origins of the swap file


In the beginning there was memory, just one kind of it. The primitive computers fifty years ago did not have the distinction between RAM and disk memory. All memory was just memory, and some rotating drum or other device provided that. Then the transistor industry began the craze and the distinction between fast memory (RAM) and disk memory was born. Essentially, the CPU can use a limited (but the word "limited" applies to 2GB here) address space of memory that can be used in a matter of nanoseconds. Were this memory the only available, most enterprise databases of today would not exist, as they easily exceed that size without a blink. That's why there's a second type of memory which is usually called "mass storage" or just the disk, that has much much greater capacity. However, as always there's no free lunch, and this memory is much much slower to access. And much slower means 100 or 1000 times slower.

That's why the operating system always tries to load into RAM the program instructions that have to be executed, as it would greatly increase the time of a program to run if its instructions had to be fetched from the slow memory instead of the fast one. Other data, such as the application own data, is usually stored on disk and recalled on demand. Of course, when that happens the program execution slows down enormously, as it has to wait for the disk to retrieve the requested piece of data, but not as much as if the instructions themselves were retrieved from the disk one by one.

This balance of disk memory and RAM memory has served the computing landscape well over the last thirty years or so. As applications became more complex, they required more and more memory for its instructions to be stored. As applications handled progressively biggest amounts of data, they required the auxiliary, slower memory, to be bigger. Technology was always provinding updates in the form of fastest CPUs, fastest RAM and fastest disks. Faster RAM has remained to this day a scarce resource in comparison, with disk drives reaching terabytes while RAM merely increasing in gigabite steps. Oh, and way more expensive than disk space. Thus, RAM was a resource that was carefully administered and assigned by system owners, because a program that could not allocate the amount of RAM needed in the machine would usually stop dead cold its processing thus frustrating the run.

However, as computers were applied to more problem areas, it became clear that it was just not possible to anticipate in advance how much RAM you were going to need. Perhaps in some situations yes, were the number of applications and its usage pattern was precisely know, it was possible to specify how much RAM you would need at each point and thus have enough of it in the machine.

With the advent of the workstation and the multiuser machine this became impossible. Since everybody could use the machine at any time with any mix of applications, it was no longer possible to delicately administer the RAM so that every application could run smoothly and not stop because of lack of memory.

Operating system and CPU designers realized that it was not necessary for a machine to actually have all the memory that was being demanded by the applications being executed. It was enough to appear to have it and provide on demand the regions of memory that were used at some point. This would also relieve from the need of manual intervention and adminstration of machine resources. They could just use the fast RAM memory until exhausted, and at that point start to use the slow disk based memory as applications demanded more memory. Of course, at the price of performance, but the reasoning was that it was better to have a slow running application than no application running. Thus, virtual memory was born.

Along those basic concepts, of course the technology has evolved to a point where virtual memory has become a mix of art and science, and every operating system that is in popular use contains a fairly sophisticated implementation of it. Jargon like "commited", "dirty", "shared", "working set" and other words are abundant in the literature, and I confess to know only the basic meaning of them, enough to get the work done. But it's a fantastic facility because it allows you to basically use your machine like it had no limits on its physical memory and get reasonable performance out of it. It's a fasciating dance to watch how the machine allocates and releases virtual memory as you open and close applications or even perform certain tasks with them.

Ok, I hear you ask, what is the "swap file" then? Generically, the place in the disk where the operating system puts the portions of RAM that are being faked as actual RAM is called the "swap file" In some operating systems this is not actually a file but a dedicated disk partition or even a different physical disk.

Why swapping is bad for your server


Of course, this virtual memory thing is all well and good, but it adds one element of uncertainity to the run time of a process, something that is sometimes undesirable. In fact, real time or close to real time systems usually forbid virtual memory altogether to be able to maintain its commitment to execution time. And it works to a point, because when your machine reaches the point of too many processes competing for the scarce RAM space, the machine spends more time bringing memory pages from the disk than actually doing any useful application work. This is what is called trashing and is bad, very bad for performance.

Workstations and servers - two different worlds


All the above applies very well to your workstation or PC machine, as you're usually using it without any predefined pattern or restrictions. You start your word processor now and need some numbers from your spreadsheet, then decide to put all together in some slides. You launch programs and close them without being really aware of how much RAM you're taking up. And you're willing to put up with a small performance loss as your systems start swapping (which is the word used to denote the point where your machine runs out of actual physical RAM and starts using the disk as a slow replacement), as it surely the convenience beats the need for yourself to organize and manage the execution order of your applications. As long as the illusion works, you're living in a fantastic world of almost unlimited memory.

Life of a server is very different. There are some characteristics that make a machine we call a "server". For one thing, it does not usually sits on your desk, but lives in a conditioned place with redundant power, protected from excessive heat or cold, and its access is restricted by a locked door for which only a selected few have the key. It's components are usually higher rated in reliability than its desktop workstation counterparts, even having probably not cutting edge performance.

All that is for a good reason, because a server usually performs some specific function on which the work of many people depends. There are file servers, application servers, database server, authentication servers, proxy servers, directory servers, name servers, and so on. Sometimes, the same machine is performing more than one role, specially where there are no resource usage conflicts between each service. But all of them are usually performing some essential function and either some other system or business process depends on their availability.

A server has to be available and usually have reasonable response time. A server runs a fixed set of processes or services, and usually does not need to deal with unexpected situations like someone sending you a 120MB PowerPoint presentation that you need to print in colour 12 times.

The corollary is, your server memory usage should remain more or less constant. Except under situations where the workload exceeds the planned capacity, your server should be happily working and its resources (CPU, memory, disk) should be used fully. The essential, unforgettable word here is "fully" Fully means completely, but does not mean "in excess"

And "in excess" when talking about memory means swapping.

Memory usage is the key


Memory is the fastest resource available to the CPU. So don't waste it. Check the documentation that comes with your database. The most mature ones have configuration parameters that allow you to precisely tune the memory usage of the database, and be aware that some of them (for example, MySQL) do have different parameters for different storage engines. So your memory usage with MyISAM tables is radically different than with InnoDB tables (by the way, try not to mix the two in the same instance) As always with database tuning, measure, tweak a single parameter, measure again and continue until you reach the good enough state.

But above all, don't allow the database to swap. You'll find documentation that says that having a swap file or partition of at least double the RAM size a good thing. It does not mean that you should use that swap space. It's there just in case, not for regular use.

How much memory should I use?


You may hear from other people that if you're not using the swap file you're not fully using your memory. They are wrong, it's a common misconception that comes from the confusion between physical and virtual memory.You may hear that if your system is not using 100% of physical RAM you're not doing it right. You may hear that if your database is using more than 50% of physical RAM your're not doing it right.

Those are the kind of "myths" that I hate. It all depends on the database server, and even the storage engine used in the case of MySQL. There is no "right" answer that applies unversally.

Oh, yes, there is in fact an universal rule. Repeat with me, don't swap.

Thursday 10 July 2008

Procedural programming and SQL don't mix well

Some of the biggest performance problems I've seen came from perfectly reasonable decisions made from fairly intelligent people. Specially, good developers coming from procedural programming languages trying to adapt the paradigms they know to the SQL language. I think that there's a strong parallelism between these cases and some of the most cited problems with people new to object oriented programming: at first, they program with the OO language but they are not really using objects, creating instead one big class comprising 90% of application functionality that essentially contains a well structured procedural program.

Data sets versus control flow

At its hearth, SQL is a set oriented language. Its main constructs and the philosophy behind them is that you're manipulating sets of rows. Years of research have been invested so that you are as abstracted as far as possible from how the database actually does that, and to do that in the most efficient way. You just write the columns that you want, the relationships between your tables and the conditions that you want to meet and off it goes. Behind that there are sophisticated tuning engines, monitors, clusters, load balancers and god knows what else, all looking to give you the results in the fastest possible way without you having to know how they actually are performed. Years of research and tons of money have been throw at the problem of how to do that in the most efficient and safe way.

In sharp contrast we have the classic procedural languages. Basically, those are a set of constructs (data types, structures, control flow) that allows you to exactly specify to the machine the instructions you want to execute. Notice that the procedural programming language is not concerned with why you want to do this or that, much less with the correctness of your steps. It just wants a set of instructions to execute on a set of data. Over the course of time and thanks to some very bright people, procedural languages have been improving, trying to make the job of the programmer easier. Mostly by abstracting, encapsulating and hiding the parts and details of the problem not relevant to your main concern.

Those are very different concepts. On one corner, a language designed only for data manipulation in the form of sets of rows with columns and relations amongst them. On the other, a set of primitive instructions that, like Lego blocks, allow you to do anything.
Needles to say, procedural languages are very powerful. They allow you, depending on the level at which they operate, to specify all the details of what you want the machine to do. They are the lowest level of abstraction, and with that comes the price of complexity. Your domain specific language sacrifices flexibility and applicability giving you in exchange much more productivity as well as a view of the world that more closely matches the problem you're trying to solve.
SQL is a domain specific language, its domain being the manipulation of sets of rows with columns that have relationships amongst them. It happens to fit a lot of scenarios, specially business automation problems where machines are replacing and augmenting manual, well specified processes. But SQL sacrifices detail, from the concrete operation of the database to its inability to perform seemingly simple tasks. How do you open a window in SQL? How do you print a report in SQL? The answer is, you don't. Well, that's not strictly true, some extensions of SQL can allow you to do such things, but they are not really SQL. They are hacks built into the language so that managers and programmers convince themselves that they don't need to learn another language.

In fact, they are really learning to fit two paradigms in the same language, but marketing is calling it the same as what they already know so that it's easier to sell them the whole thing.

Some examples


The following are real world examples of these types of problems. I've found them everywhere, from SAP to custom built J2EE applications. Bear in mind, those are general rules and you'll sometimes have a valid exception for them. But surely it's not going to be the first, second or third time you encounter this. Exceptions to these rules are very, very uncommon.

Loops versus single SQL statements

Ignoring transaction isolation levels for a moment, you will agree that the following pieces of code are equivalent:

FOREACH R IN (SELECT * FROM T WHERE T.B=2)
UPDATE T SET T.A = 1 WHERE = R.
LOOP

UPDATE T
SET T.A = 1
WHERE T.B = 2;

And yes, they do the same thing. But the seasoned procedural programmer delving into SQL tends to write the first version, because of the feeling of having complete control. But, and this should not come as a surprise, the second version usually performs much better, or at worst on par with the first one, is shorter and easier to maintain. This, taken to the extreme, makes people write:

A = MIN_INTEGER;
FOREACH R IN (SELECT VA FROM T)
IF R.VA > A THEN A = R.VA;
LOOP;

instead of

SELECT MAX(VA) INTO A FROM T;

and this time, believe me, for enough rows in T the performance difference is spectacular. Of course, in favor of the SQL statement.

Function calls in WHERE clauses

The following two pieces of code are also equivalent

SELECT * FROM R WHERE SQRT(R.A) = 2;

SELECT * FROM R WHERE R.A = 4;


But as you have guessed from the subject, the second is usually much faster, because you can use indexes to look up the values of R.A directly. Oh, yes, the latest version of your favorite database has function-based indexes, I hear you say. Well, what if you need to look up values based on the results of three different functions? Are you going to create an index for each and every different operation you perform? Function-based indexes are one of those features that can save you from disaster but only if you're one of the selected few that actually needs them, but are not useful on the general case.

Function calls versus in-line calculations

This one has already been mentioned in this blog. It is very nice to have a function like getCustomerCredit(CustID) that calculates the credit rating for a single customer based on its order history. That way, your procedural code that deals with setting up the screen display for the order entry can just do a lCustCreditLabel.Value = getCustomerCredit(custID). But when the procedural programmer is tasked to create a printed report of customer ratings, the first version is always:

SELECT custID, custName, getCustomerCredit(custId)
FROM customers....

Probably getCustomerCredit looks somewhat like

SELECT Customer.creditLimit - SUM(orders.OrderValue)
FROM customers, orders
WHERE customers.custId = :custId AND orders.custId = :custId AND orders.stat = 'OUTSTANDING';

Instead, the following alternative will be far, far more efficient

SELECT custId, custName, creditLimit - SUM(orders.OrderValue)
FROM customers, orders
WHERE customers.custId = :custId AND orders.custId = :custId AND orders.stat = 'OUTSTANDING';
GROUP BY custId, custName
;

Replacing built-in SQL functionality

This is probably a repeat of all previous ones together. When something can be done with raw SQL, it is almost always faster and better to do it in raw SQL. Fancy frameworks may hide some of the complexity, but will make you paying a price. And this price is usually performance.

You can have some structure in your SQL


As you have seen in the examples above, the more efficient alternatives are breaking structured programming, abstraction, detail hiding and probably some other principles of structured programming in some way or another. After you've worked long and hard to encapsulate the business rules in a way that avoids repetition and hides complexity, I'm basically telling you to throw them away in favor of better performance. I don't have an answer, and I think that nobody has, to this dilemma. Possibly because SQL was not designed to abstract business rules or abstract data structures beyond tuples (rows) with attributes (columns)

That's not to say that you cannot have any kind of structure in your SQL. Quite the opposite, you should strive to have it. There's nothing worse and unmantainable than an application that is hand crafting SQL sentences and sending them to the database to execute for each little thing it does. But always keep in mind that abstractions are useful as long as they pay off. Don't abstract breaking the assumptions made at either language desing or database engine development.
For example, it's very common and good practice to abstract business rules (data manipulation and validations) in stored procedures.

I'm not advocating cut and paste programming or that you should not try to be as structured as possible in the coding and design or your applications. But you really have to be aware of the trade offs, and above all, don't try to reinvient the wheel. The designers of your RDMBS spent probably a long time thinking about how to make their engine efficient and easy to use. Don't fight with them, take the red pill and do things its way. It's better on the long term.

Thursday 3 July 2008

Maintenance programmer and performance

It's almost universal. And unavoidable. Yet I'm not sure if it's right. But it's true. At almost any organization whose goal is to develop and maintain software, there is a knowledge based hierarchy. The best and brightest individual tend to take care of the more complex issues with programming and architecture. The juniors and newcomers are assigned lowest value added tasks. Usually, organizations tend to offer career options and other incentives to their best people, with the hope of retaining them and extract more value from the time they are in the office. The less experienced, productive or talented people are invariably relegated to jobs that are perceived as adding less value.

Agile/Extreme Programming methodologies try to get away with this segregation of categories by using "Pair programming" and making everyone accountable for all code in the system, but this is in practice difficult to achieve. It needs to have a more or less homogeneous team in terms of experience and capability, as without an equal footing "pair programming" ends up morphing into some sort of mentoring scheme. In that scheme, pairs are allocated in such a way that there's always a "master" on the subject working together with someone less experienced. In an ideal world, they will become true pairs given enough time, but time is money for a business that wants results here, now and cheap. Nonetheless, and although I don't have field experience on that, I think that pair programming properly done can be one of the greatest assets of a programming team.

It's after all the best business interest to maximize the return of every penny they put at risk (read invest in a software package), but this hierarchical scheme it does usually have an impact on the way the software maintenance is done. Because the best people is usually assigned to the "best problems", and maintenance is often consigned as a necessary evil.

I've never understood that point of view. All the literature about business software development is sending the same message about emphasizing readable and clear code based on field data that proves that maintenance is often half of the total cost of any application, if not more. It should follow logically that assuming that maintenance costs are high, maintenance taken as its own profit centre should be given the importance it deserves. But in fact it's not often done.

Part of the reason, I suppose, stems from the wide scope of the word "maintenance" From simple changes in user interface or report layout to whole data model changes, all of them often fall into the same bucket of "maintenance". While obviously the profile and capabilities needed for one task versus another are deeply different, the fact is that the same resources are assigned usually for both kind of tasks.

And that invariably impacts performance. And I'm sure that even strict adherence to the most rigid change management methodology will still leave business systems to unexpected performance problems, if only because assessment of the performance impact of changes is still an unexplored area.

Add a bit here, add a bit there.


Let's tell a story that illustrates this very well. Bob is an entry level developer assigned to minor enhancements in the company ERP system. One day, Bob receives a call from a change manager that has already completed the immense paperwork required to add a simple data point to a data entry screen. When entering a customer order, the system has to display the outstanding value of all orders pending to be delivered to the customer.

While Bob is relatively new to programming, he is able to complete the task and creates a simple subroutine (or method, or user exit) that computes the outstanding value of all orders for an individual customer. The change is tested, moved to live, and for Bob it's all well and good.

Months later, some internal control manager which is being part of the internal audit team needs to take action around a point remarked in the last audit around the excessive number of customer orders entered that exceed the per customer agreed order volume. He calls for a change so the data entry clerk can spot not only the amount but also the outstanding volumes.

Bob receives the request and changes the subroutine (or method, or user exit) to display that information on screen. It makes sense to do it at the same time that the order value is calculated, since they are closely related. The change is tested, moved to live, and for Bob its all well and good. So good in fact that he's asked to create a report listing all that information, since the customer service manager will start its daily operations meeting with that information on the desk for the team to arrange and prepare shipments.

Being smart, Bob creates a report that calls his function once for each customer in the customer master, checking for the outstanding order volume and adding it to the report if it has some. The report runs a bit slow, but that's not a problem because the report is run overnight. Change is tested, moved and everything is good.

Bob moves into other projects, and eventually leaves the company. He's replaced by Sean, an equally talented and motivated person as Bob, who later on receives the request to change the report. As the customer service manager focuses his daily meeting on the activities done in the next 24 hours, he needs to split the outstanding order volume by delivery date, so as not to waste time reviewing something that it's not an immediate concern.

Sean examines the report and without hesitation, changes Bob function so that it breaks totals by delivery date and adds them up at the end. Of course, the function is even slower and the report more so, but for Sean is the most logical way to complete the work since it involves the least amount of change and effort. Also, the report is giving 24 more times information, so it's reasonable that it takes more time. Oh, besides that the report is run overnight, so performance should not be too big of a concern...

Sean's change is tested, moved to live and all is good... well, not. Now all the data entry team is screaming because they cannot enter orders. The system has become inexplicably slow and it takes ages to do now what in the past was acceptable, if not exactly snappy. Of course, the explanation is that the same program logic is being used for two very different contexts (night batch report and live data entry) with very different requirements. Each time the program is changed it makes perfect sense in that concrete context, yet the cumulative changes lead to a problem.

I've noticed this pattern a lot of times happening in the real world (in fact, this story is true except for the person's names) The problem would have been avoided if close testing of the whole system were done, or if close inspection of code revealed its usage in multiple places, and possibly the conflicting uses of this.

Of course, had Sean or Bob been mentored, they would have included some comments or documentation somewhere. Had complete performance testing of the whole system been done the problem would have been detected. But who does that for a seemingly innocent change in some report output or for an information only label on a data entry screen?

The irony is that, when the problem is detected, poor Sean is blamed for its poor coding practices, while he had few choices in his way to proceed with the change. Had he created another procedure he would be seen as "copy & paste" programmer, not taking advantage of functionality already there and investing more time than necessary in the change. Had he said "look, it's very inefficient to loop over customers just trying to find one with outstanding orders, let's try to make this the other way as it will be more efficient" he would have been said that performance was not that important. Had he spotted the actual problem of reusing the code because it was used also in an interactive data screen, he would be said that he was not assigned to that part of the code.

The curse of the maintenance programmer


All of the above conspires against the maintenance developer. Regardless of their experience and skill level, they will be always caught in the middle of discussions. The IS side will argue about costs and resource limitations and the business side will argue about costs and deadlines.
Plus, an existing application forms part of a closely interweaved link of process, knowledge and business practices. It's not easy to justify any refactorings once an application is live, if only because the benefits are never apparent in the short term.

I've seen very talented developers despair when being assigned maintenance roles. Not because of the quality of the challenge or problem they have to face, as some systems can be more complex to maintain than to develop given the number of additions that have after its go live. No, the reason for their despair is that they have to deal with a codebase that is inherited from not only the original developers but also anybody else that has been it their chairs before. When they are asked to do anything with the code, and after overcoming the "complete rewrite syndrome", they may feel comfortable with it. However, when faced with change they are always compelled by management to take the shortest (read cheapest) route. As we know from the travelling salesman problem, this is not always the optimal solution.

Tuesday 20 May 2008

Database tuning myths

Businesses are always looking for predictability. From financial plans that span three years to operational plans for next year, to five year strategic plans, there is a good amount of time and resources at any business devoted just to predict what the future is going to look like. Entire business functions are needed to create the plans and track its evolution. And the stock market regularly punishes companies that do not meet their financial plans. Ironically, they do it only the ones were the objectives are not met. Someday they should look at some medical disciplines and discover why growing faster than you should is not always good for your health, but that's another story.

It's not surprising that all those processes create a culture of predictability, where everything must be planned in advance and coordinated. Over time, and by repetition, those processes become so embedded in the company culture that people forget why things are done in a particular way. This has many benefits, as the process of planning itself becomes predictable. But it also transpire to other areas of the company, and everything that is either important or big enough ends up having a detailed process that everyone has to follow. Detailed and predictable outcomes reinforce the hearts and souls of the financial people because that full fills one of the basic tenets of capitalism: trust in the future.

Unfortunately, this model does not scale well if you want to keep an eye in all the details, even at the business process level. At some point, you have to forget about the details of how something is manufactured, or how the accounting is done. That gives you abstractions, and those are an essential tool for scaling up the complexity of any model, because without them you would get lost in the details very quickly.

Out of those abstractions, things that are know as "rules of thumb", or more corporate/engineering "best practices" emerge over time. Those are supposed to be generic rules or techniques that can be applied to any problem and give satisfactory, predictable results.

But as many are discovering, abstractions are not perfect. By their very nature, they ignore the details and focus on a few areas of interest to make them reasonably simple to use, its usefulness given precisely because they ignore the details.

If you've kept reading up to now (thank you very much), you may be wondering how does all this relates to database performance tuning. Well, this is one of the places where I see daily rules of thumb applied, often with disastrous results. Computer systems have evolved from the centralized mainframe to the distributed storage, processing and client access environment of today. Where in the past you had a machine and storage system with a thin operating system layer on top now you have to deal with anoperating system that attempts to coordinate a wide range of things connected to the system, plus a network stack, a database persistence layer, a business process model layer, a presentation layer and a client (thin or thick) layer.

What were in the past valid "rules of thumb" in one environment cannot simply be applied to a different, more complex one. Well, in fact, some of them were never applicable in any environment at all. Here's my "best practices hell" top three:

1- We just need to use fancy new hardware


The number one, because there is natural tendency to associate something physical with the performance of the system. And that's the box sitting in the data center. Just as you can get a faster car, why not just get a faster computer?

The answer is, yes as long as you make sure that you actually need it. Adding a faster CPU for a system that is disk bound is not going to improve its performance. Putting a faster disk to a machine that runs processes that eat up all its RAM will only make it swap faster. Adding more memory to a system that spends most of its time sending data across the network does not improve its performance.

In short, you need to diagnose before healing. You need to know your system and you need to know your application and how it's using the system. Without that knowledge, you are just guessing, and spending a lot of money at it. And I've watched this happening more than a few times. It's very frustrating to spend money and resources in an expensive hardware upgrade only to find that the results are unnoticeable.

The reason behind that is that you've reached some hard limit and that your whole system cannot be easily improved by throwing more hardware at it. Then it's time to examine your application and move its performance boundaries to other components. This sounds complicated, and it sometimes is. But often it means checking the application and making sure it's using the database resources with measure. The classic case is the 10.000 item grid that is completely filled up on applicationstart up . The network round trips necessary for this are huge and the application changes usually minor. The irony is that after this exercise, you'll probably discover that you actually did not need any new hardware at all.

But the myth stays firmly planted in the industry because often a hardware replacement means performance increases of all its components. Therefore, whatever the performance boundaries, the system has better performance after the hardware upgrade.

2- We just need to create a few indexes


This is one of the most popular, yet dangerous myths. Come on, let's finish this discussion taking the myth to the extreme. Go ahead, index everything in your database. And I mean every single column and every different combination of columns in each table. Now try to do a few queries. Wow, some of them are probably faster. Now try to do any INSERT or UPDATE on it and watch the extreme performance degradation. Convinced?

The explanation for this is that an index is just a table with a special organization that allows you to find things quickly in it. Yes, it's a small table because it contains only the indexed columns, but it's a table nonetheless. Each time you add an index you're creating one of those tables and increasing the overhead of keeping all of them up to date, not to mention the concurrency issues that your database has to deal with each time it's updating a table and it's indexes. Oh yes, I hear you saying, ha!, cluster indexes do not have this problem! Of course, but the problem with cluster indexes is that you cannot have more than one for each table, so it's very unlikely that you're going to cover all cases with them.

Remember, performance tuning is a trade off. It's about using available resources where they are more useful. And this is clearly not a way to profit from your system.

This is not to mean that indexes are not useful. Yes, they are. But anyone that is applying rules of thumb is sooner or later going to fall into this trap. Remember, you've been warned.

3- We did this the last time and it improved performance


As explained in the introduction, "rules of thumb" do not account for complexity. Face it, the systems you're dealing with are very, very complex. the performance of a system is dictated as much by its external environment and constraints as by its own internal characteristics.

So, unless the previous one was identical to the new one, this assumption is likely going to be false. In particular, this myth is fimrly implanted in single product specialists and consultants and is very visible when they move to something different. Remember, reject any changes that do not come with good background. Just saying "because I've a book that says so", "because the vendor recommends it" or "because in my last project at customer xxxx it worked well" are not valid reasons per se.

Wednesday 14 May 2008

Having a baseline

While the technical part of database performance tuning is almost different in each case, on the business side of things I run almost always into the same conversations. My favorite one is the conversation with potential customers. These follows more or less the same structure. And it always begins with the same sentences:

-Customer: our application has performance problems, we're looking for someone to improve it.
-Consultuning: have you already identified specific parts of the application that are not performing as you need?
-Customer: well it's this screen/process/job that is being perceived as slow.
-Consultuning: ok, let's have a look at it. What exactly do you mean by slow?
-Customer: I mean awfully slow, our business cannot operate with these response times.

Ironically, you'll think that at this point the customer answer would be "it takes xxxx seconds", but usually that's not the case. It's unusual to face a customer that has actually timed it at least to the second. For interactive applications ("dialog steps" in the SAP world) this is understandable, as it was proved a long time ago by usability studies that anything longer than three seconds is perceived as "slow". For batch processes, since they are run unattended, the execution time is not usually monitored even if it's captured in the system logs.

But remember, if you don't measure it, you cannot tune it. Any performance improvement effort will be a balance of cost, time to implement, and resource consumption. These have to be weighted against the benefits you're getting. And unless you have some tangible numbers, you simply cannot do that. So you'll not have any basis to accept or reject any proposed changes. One of my rules of thumb is to reject any proposed change if it's not backed up by some measurements that prove that the performance gain is worth the cost. This is also one of the points of friction when dealing with people (or automated tools) trying to improve performance by applying "rules of thumb", as the cost and impact of changes is often ignored. But is worth the discussion.

And what about the targets?


Yes, ideally, you should have a performance target. The problem is, the very nature of performance tuning is not deterministic. Because if you already know what your target is and that you can reach it, you probably already solved the tuning problem. So the conversation follows...

-Consultuning: well, how fast you want it to be?
-Customer: of course, I want it to be fast enough.

My advice is that you try to define "fast enough", but be aware that you'll not get any commitment to reach that performance level from any tuning expert worth its title, except in the "been there, done that" cases, but those are very rare with bespoke applications. So we have two cases:

-You have a performance target. Stop tuning when you reach it. Save your tuning budget for later rounds.
- If you don't have a target in your head, keep this instead: each tuning step will yield less performance benefits than the previous one. At some point, the gains will be minimal, and the cost to implement them too high. This is the moment where you should stop tuning.

Wednesday 7 May 2008

Sometimes a tenth of a second is important

As any other field where a certain skill level is reached, performance tuning for the uninitiated becomes some kind of magical black art beyond certain point. It's then part of your skill set to try to explain in terms that a non specialist can understand, what you're really doing. (and they are paying for)

This is the point where excessive simplification can, and in fact does, give you problems. A case in point was a tuning exercise done on a large import process for a database. The import process itself was iteratively doing on a row by row basis an SQL query that retrieved a single row. The application design was not optimal, as this access was not even partially indexed and the import process was agonizingly slow.

The data model was "customized" from the original, big application vendor, design to adapt it to business needs, and that customization had not taken into account all possible impacts of changing some meta data. But since the import process was also supplied by the vendor there was no way to change it. Neither calling the vendor for support was an option.

Ah, I said, no problem, we can create the necessary index and be done with it. But first, remembering that performance tuning is always a balancing exercise, I took the extra caution of checking the cardinality of the involved fields. The query in question had four columns in the WHERE criteria, but only three of them were significant in its selectivity. The fourth one just had a 2:1 ratio, while the other three together had more than 10.000:1 ratio. Thus, I created an index on the first three fields and left the fourth alone.

It is always a good idea to check the selectivity of your indexes, having unnecessary columns on them simply degrades performance without any benefit in the access times. It's also a very common, and wrong, performance optimization "rule" among non specialists that says "just add more indexes" as an answer to almost any database performance problem. The cardinality check is usually a good way to dispell that myth, because not always more indexes equate better performance.

After I did that, the results were good, the single row select was down from 10 seconds per row to .2 secs per row.

Well, that's it, I said. Only until later on the application developer came in and asked if I could further improve the performance. Since it was already 50 times better, I was in denial of adding the remaining column to the index. And when I did that, performance was down to .1 secs per row instead of .2. This kind of performance improvement, while spectacular in itself (how many times can you double something for such a small price?) was nowhere near the order of magnitude gained by the first one, and probably not worth doing in the general case.

Unless of course the single row SELECT is performed 100.000 times. Then it really makes a difference to wait 10.000 seconds or 20.000. In the first case your process will finish within normal working hours. In the second, you'll have to stay in the office late until it finishes. Case closed.

Over-specify to avoid performance problems

Nobody likes to stop the press, the train or the plane, or the cash collection, or whatever process their business is engaged on, waiting for a computer to finish its task. Nobody likes having to fit the process ("list of deliveries will not be available until 11:30 in the morning") to the timing of a machine. Those are usually the symptoms of a computer system not performing up to the business requirements.

Of course, this is more commonly seen in the business world. Software manufactured in other fields simply cannot afford to have performance defects. Think of life support or vehicle control systems, where performance is a requirement that is expressed in very precise terms. Think of video games, where jerky animations will mean that end users will not purchase them. Software used to support business processes is usually viewed as a constant balance between cost, time to market and quality. Seasoned project managers, when confronted with the previous sentence will instantly recognize it and say "you cannot have the three together, you have to pick two"

In the cost, time and quality triangle, where do you place performance? As veteran project managers and performance consultants already know, the answer probably lies somewhere between the three. Performance is surely going to be considered a quality item, but also happens to impact both price and time to market. Here's my experience in the field, note that your mileage may vary.

In the end, it's the quality that suffers


The traditional project life cycle in business environment is seldom completed as it was initially envisioned. Very few projects end up with their scope or schedule unchanged, if only because the traditional business practice of "optimizing" everything cannot resist the temptation of subtracting resources or shorten the time line if the prospects of a project finishing on time and budget are good. Very seldom business will drop features or accept to postpone delivery dates, perhaps because the business side of the project can understand very well those concepts, where "quality" is usually more abstract and harder to quantify.

Besides, this is software, right? Being software, it means that the same principles applied to other engineering disciplines are not applicable here. This is not the place to discuss if software is engineering or not, but certainly some rules that are usually applied to well established engineering practices do not apply. Can you imagine a civil engineers changing the width of a bridge once the construction work has started? Hardly. But the same civil engineers, when leading a software project, do not have any problems changing the specifications for data volumes while building the system. This makes even more unlikely that all the spec changes stay in sync, not to mention the performance ones.

To have any chance of surviving schedule and scope changes, performance testing, like unit, functional or integration testing, should be an integral part of the project plan. If you don't have performance requirements and testing at the beginning of the project, skip to another article, as the damage has already been done. You'll resort to monitor the live system performance and watch for performance trends to show up before even having a clue of what you need to do to keep acceptable performance levels.

But if you have performance testing as an integral part of the project plan, don't hold any high hopes either. In my experience, it's the "quality" part of the triangle that suffers. Outside of mission critical contexts with human health at stake, and in spite of all the "quality" buzzwords that surround modern management methodologies, changes in scope or schedule always will win over quality. Part because "quality" is a hard to define concept for business software and part because the number of defects in a software application is largely irrelevant as long as it fulfills business requirements today. And today means this quarter, not even this year. Since software is infinitely flexible, we can always fix defects later, right?

This philosophy largely permeates management because, much like performance tuning, everything is seen as a balance. Your application is perhaps not complete, has a few outstanding bugs and has not been performance tested. But the business benefits, and sometimes management reputations, are much preferable than moving the delivery date to have time to fix those problems Let's not even talks about performance testing, which except in cases where a good previous baseline exists, is almost impossible to estimate data volumes or densities.

But it is manageable, isn't it?


The answer is both yes and no. Yes, everything that can be measured is manageable (and even things that are not measurable should be manageable, it's only that you cannot easily rate how good you are at doing it) No, when you reach the point where managing an issue becomes more expensive than any losses you may incur by simply not trying to manage it until you are sure of what the issue actually is. That is, if you've to specify performance targets for each of your application processes, then you're probably going to spend much more time than just saying "no process should take more than 3 seconds" After stating that, you'll discover that 3 seconds is actually a very short time to generate a couple thousands of invoices, so you'll adjust your performance requirements for that. At the end your performance requirements probably will consist of a single word: "enough"

Paradoxically, this is the point where you've realized what are your performance requirements. It makes no sense to tune any process as long as it's performing well enough. It only makes sense to invest in tuning the processes where performance is not enough.

Trouble is, over a lifetime of an application, what is "enough" will change. Because the number of your customers can change, your two thousand invoices can become twenty thousand. Therefore, performance tuning becomes part of application maintenance. And that does not mean that you'll be always tuning your application, just means that you have to keep an eye on its performance, so it would be more precise to say that is monitoring what's actually part of the standard maintenance process. Actual tuning happens when the monitoring detects something going wrong.

But you want to avoid that, right? Because tuning is an expensive process that does not deliver a clear business case, unless it belongs to the survival dimension. The question is, can I really having performance monitoring at least outside of the normal maintenance activities?

And again, the answer is both yes and no. No, unless you're always right in your business forecasts and the numbers you're playing with don't change over the application lifetime. Remember, we're talking three to five years minimum application lifetime. Are you so confident?

And yes because there's an easy way of avoiding it entirely. In essence when you plan your system there is a stage that we'll call "capacity planning", which is what you're supposed to do when you specify your system hardware. The trick is very simple: take the hardware specs recommended by application vendor/developer/consultant and triple them. No, better make them four times better. In all dimensions, not just 4x disk space, but also 4x memory and 4x CPU speed. But don't set that system as your dev or test environment. Keep the test environment on the vendor/developer/consultant original specifications and make them comply with your performance requirements on that environment. Sounds extreme?

Perhaps you are thinking "why just make the performance requirements four times better and set up a test system that is equal to the live one? would that be more supportable?" If you're in the kind of environment where quality of application support depends on having the exact same hardware for test and live, you probably stopped reading the article a while ago, because you belong to the minority that develops operating systems, those pieces of software that isolate applications from the hardware they are running in. I'm not advocating that you choose different architectures, such as testing on Intel and deploying on SPARC, just that you get a scalable enough platform that allows you to set up different environments for test and live. Besides, in the long run, your test and live systems will in fact will be different. It only takes an expensive HW upgrade to realize that you don't need the fastest and more expensive disks on earth for your test environment. In the realm of business applications, that just does not happen.

You need some discipline, because invariably there will be performance tests that pass on the live environment and fail on the test one. Avoid being tolerant, this will become over time more frequent and will inevitably lead to performance problems because it will make the developer/vendor less disciplined. Set your performance goals on test, and have in your head some rough numbers that will allow you to predict how it will scale on live.

If you're willing to go this route, be prepared. First, the live system is going to be expensive. Very expensive. Second, the pressure will grow on accepting performance results as they are going to appear on the live system, not on the test one. Third, the 4x figure is based on experience, not hard data. After all, you're buying four times the system that is supposedly run your application well. Only a lot of experience and, if they exist, learnings from past experiences, will make your point strong enough to allow the increased hardware price.

It's your risk, you manage it


Did I mentioned that before? Yes, as always you're facing a balancing decision. Spend more than you theoretically need or face the performance problem later. In a business context, there's nothing that cannot be prevented or mitigated by javascript:void(0)adding money to it. It's just that sometimes it's too much money.

If you choose the traditional route, accept that at some point in time, your monitoring will scream tuning at you. Set aside some budget for performance improvements, it will be much easier to justify.