Friday, 18 April 2008

Tuning tools: beware

Older readers may remember the MSDOS extended and expanded memory standards. Back in the pre Windows 95 era, Intel machines were limited in the memory address space they could use. Not only because of the hardware, that until the 386 processor was limited to 16MB (yes, MB, not GB!) of RAM, but also limited by the MS-DOS compatibility constraints. There's a famous quote misattributed to Bill Gates saying that nobody would ever need more than 640K memory. This was, obviously, false. And the industry was busily developing solutions for the compatibility limits imposed by MS-DOS initial design. A number of software makers, from Quarterdeck to even Microsoft, started to release memory tuning utilities that could, in theory, maximize the usage of that precious 640K low memory region.

And they mostly worked. You would not even needed knowledge of what low and high memory was. Just running them gave you a system that was close to optimal in their low memory usage. And you could try to boast about your low memory size with your friends, only to find that they got exactly the same results as you. Trying them was very easy. You just ran the optimizer and accepted its suggestions. The you rebooted. After that, either your system hung or worked. In the first case, you booted with a diskette and undo the changes. In the second case, you checked low available memory. If you had a better value than before, then bingo.

Databases are no exception to the configuration game. Correct settings can give you a 10 times performance improvement over inadequate settings. And the cost of the change itself is close to zero. Just adjust some registry setting or configuration file, restart and it's done. Knowing the right value, however, is not that easy.

As databases continue to have more sophisticated features, the complexity of correctly managing them is increasing accordingly. What was in the past a mere job of checking a few memory parameters has become for the most part in a very delicate set of adjustments that balance the application behaviour and the machine resources and configuration. Database vendors recognized long ago that for their products to become a commodity they needed to somehow allow the average user to enjoy the product features without requiring from them a four month DBA course. Their solution was two fold.

Sensible defaults

First, experience derived a set of sensible default values depending on the intended usage (such as "OLAP" versus "OLTP") that allowed the product to work by default at least reasonably well. In my experience that is the part that has been most beneficial for performance. Having some sensible formulas, such as "allocate 80% of RAM to disk cache" mostly work, but those formulas are not designed to cater any special system or application needs and/or behaviour. A couple of my favourite examples are:

- A complex query that took more time in the parse and analysis stage than on data retrieval itself. The database did not have enough memory to cache the parsing and execution plan stages after a few more queries executed. The symptoms were easy to spot. A customer had a query that, when issued multiple times, was executing very quickly after the first time. However, as time passed on, execution times was equal to the time of the first query. Simply, the cached parse stage and execution plan were being recalculated too often.
- A single machine was hosting the database server and the application server. The database buffers were simply eating up available physical memory and the application server was also claiming its share of the RAM. The symptoms? The database system performed very well before starting the application server.

It's very unlikely that default values will ever be able to cover all the possible usage scenarios. Even if they could, they cannot control resource allocation of other components in the machine. That is, if your environment matches what the developers had in mind when defining the "defaults", those settings are not going to be too bad. And they usually provide a good starting point or baseline to measure against further changes you make.

Automated tools

Second, they started to release the database version of the MS-DOS memory optimizer. Software that looks at database response times and recommend changes based on that. There are many variations of those, from simple query analyzers to big brother-style services that collect the logs and watch activity in real time. These automated tools usually come as part of bigger packages and are intended to help the DBA keep a database system healthy and responsive.

These tools are very good at taking a lot of information and present nicely reported results that give an overview of system performance. The best ones even suggest improvements based on the information they collect. And therein lies their biggest danger. Much like the MSDOS memory optimizers, following their suggestion does not always improve performance. Unlike the memory optimizers, those suggestions when implemented can mean serious business disruptions. Those cannot easily be fixed by booting with a diskette and undoing the changes.

My favorite example are the query analyzers. One of the most heated discussions I had with a customer was around index creation for a medium size table. They had a long running SQL query over medium size tables and the advice from the tool was to create some indexes. However, without taking into account index selectivity, half of them were essentially not worth having.

This is an example, but you must keep in your head the global picture. Just taking a single query and inspecting its access plan just can, in the best case, improve the performance of that single query. But that does not mean overall system performance improvement. In fact, it can mean overall worse system performance. On some cases, I've watched also something that's even worse. Having one such tool that is sanctioned and supported by the application vendor creates an effect where the suppport staff is completely blindsided to anything not reported by the tool. The tool recommendations are implemented without a second thought, even if they don't improve performance at all or mean large downtime windows. The true performance problems are hidden behind the reasoning "this is what the vendor recommends, therefore by following their recommendations we have the best possible performance" Sorry, I cannot tell you the application vendor or the tool, but if you're running a big ERP package with thousands of database tables you know who it is.

Like the "sensible defaults" option, those tools are not bad per se, but in this case they are even more dangerous. If you're going to take their advice, be sure to have a good test environment where you can replicate in as much as possible your live environment conditions. Test, test, and test, before implementing the changes.

It's all in the context

While RDBMSs have been converging to what it's called "commodization", which is a term used to describe a market segment where different brands are interchangeable one with another because they have easily comparable features, the same cannot be said of applications that use the RDBMS as its storage layer. That makes the context wildy different, and thus, those tools alone cannot be used to improve performance.
Why you say that? It logically should follow that, by following their advice, my system will have better performance. To refute this, apparently, perfect logic reasoning, just look at the context. If there's a common flaw in both tool categories it's the context. In the sensible defaults case, the context is the rest of the processes running on the machine. The automated tools, being more ambitious, have even more context to take into account, from other processes in the machine to usage patterns, different modes of operation, changing data densities and business seasonality and priorities.
In the end, like all tools, they need to be in capable hands to be effective. So next time you hear someone proposing a change in your database, ask why. If the answer is "because this tool has recommended it", just reject the change until the proposal has a solid background.

Monday, 14 April 2008

The business case for performance tuning

One of the most difficult parts of the engagement with a potential customer is the time to discuss your rates. Specially in comparison with the rates usually paid for other kind of professional services involving technology. Let's explore some of the reasons and the economies behind database performance tuning efforts.

The rates

With the recent wave of out/off/shoring, rates for services in the IT sector have been, as the financial types like to say, “rebased”, which basically means that the IT market has evolved from one with high demand and scarce offer to one where the balance is more on the side of the demand. This has a lot of implications, most of them know already by the wise old IT crowd. But the focus on short term financial benefits can sometimes make the financially minded people deaf to the IT doomsayers.

Of course, you can find many examples where cost reduction in IT labor rates has not meant an associated reduction in quality, or increased maintenance costs, or increased opportunity costs. But this is probably not being noticed because what the media is focused on is the failures of these business models instead of the success. However, to be honest, the cases where out/off/shoring has provided gains in the financial bottom line AND at the same time increased quality of deliverables AND reduced on going maintenance costs are quite rare. Well, to be honest, I've yet to come across one of them.

Sorry, excuse me for the digression, back to the main topic. Well, not really, because the topic of cost is quite relevant when confronting someone who has already paid quite a bit of money to have a system up and running and now is being told that it has to pay even more to make the system survive to business organic growth. Note that the measure of “quite a bit of money” is entirely subjective. Even when employing the cheapest resources available, the system will have, from the point of view of the payer, cost a lot of money.

Now, try to open that without a locksmith

Tuning exercises are not cheap. Reasons include the fact that the job requires experience. A lot of it. Databases have lots of features, and SQL is a rich and expressive language. Each problem has multiple solutions and it takes a lot of experience as well as a lot of false starts to realize what are the best possible solutions for a performance problem.

Once you have experience, you need to drill into the data. Yes, performance problems do not have a “best” solution per se, and this is something I hope to develop in future posts. The same solution applied to different data patterns has completely different results. There are no magic recipes that can make a system go faster. Well, there are, but they can be like steroids for athletes. Abusing them is dangerous on the long term, but they give spectacular benefits on the short term. Mmmm... yes, abusing indexes is not going to make your life shorter. And no problem with sport authorities either. What a crap analogy, but the topic of magic tuning recipes will certainly be worth of future posts.

And finally, after all the experience and data knowledge, there has to be a lot of business engagement. Only by understanding the processes and background reasons for the a system design you can suggest good ways to evolve it performance wise. Business management are not comfortable with tech talk. They prefer to have communication in terms that they can understand instead of be mired into jargon.

The above skills are not easy to find. And it's likely to be at a professional level that is well above the cheap offshoring rates, both because the experience needed and the level of communication. So yes, it's difficult to understand. But last time you lost the front door keys while beign outside your house, you had to call a locksmith. The guy came and with a very small tool set and in very short time opened your supposedly burglar proof door. And it was expensive, very expensive.

At least that's what you remember. You surely don't remember that, without the locksmith' experience, skills, tools and time, you would had to pay for a new front door instead. Which would have been way more expensive.

Can I avoid all this by working smarter?

Hardly. First, it's very difficult, if not downright impossible, to anticipate the growth rate a business is going to have. Second, it's almost as difficult, if not more, to predict a system behaviour under load unless extensive background experience on similar systems is available. Or else, you're ready to set up a test lab. And you're ready to over specify the system hardware.

After all those reflections and a few back of the envelope calculations, you'll realize that it is just not worth even trying beyond having a reasonable expectation that the system will work in a year from now. Anything you plan to do today during next year is going to probably change in the next three months. If you are in those kind of business change pace, then you cannot positively absolutely set up any performance projections based on current business conditions.

Note that “change” in this context does not necessarily means business change but priorities as well. Even organizations that display the less amount of change to the outside observer can become literally paralysed by the rate at which priorities change internally. In fact, in some of them they change so fast that there is literally no time for the previous change to settle down or being completely finished (side note: those kind of business usually have applications that, in addition to performance problems, show some interesting common design patterns)

Even if your business environment is stable, you'll still have to make your back of the envelope calculations. Because more than a cursory examination will yield a result that probably is going to contain assumptions on data density, distribution and frequency.

Oh, yes, the sales manager says, each customer is invoiced monthly, just once a month. And the financial planning team has been spent six months detailing their processes perfectly fitted to the sales ones. After that big effort, they're not likely to change in years. Then comes a new sales director and customers are invoiced each time they get a delivery. And the financials for that are very different, so planners change their process. And the logistics need to update stocks daily. No, wait a minute, not daily but 48 times a day. All your careful performance assesment goes by the window in a minute.

Where's the business case then?

With all that said, probably the first question in the line would not pertain to database performance, but perhaps to application development methodologies. There are ways to prevent and mitigate the problem happening in the first place. But this discussion will always end in the same place, which is where to draw the line between business benefits and IT costs. And by applying sound business principles, the line will be much closer to the business priority side than you want. Face it, there's very little you can do about it.

But then comes the problem of justifying spending money on just making the application meet its performance requirements. If you're savvy enough, you already know that business arguments can happen in a golf course, while having a couple of drinks, or in the tennis court. And all those are valid ways for your arguments to win. But it's always best to have some rational arguments prepared, just in case your demands for additional money are rejected.

In theory, it should be easy. Just put a value tag on each of the activities that your application supports, add all them up and multiply by the number of times they could be done with better application performance, right?

In my experience, this is just the third best way of looking at the problem. Because the true business case for performance tuning usually is found on three fronts:

Survival cost - that's the easiest one. You need better performance to just keep doing business.
Opportunity cost - a bit more difficult. If you had better performance, what other process would you be able to support with the spare capacity?
Productivity - as said, the easiest one. Your plane has its engines turned on for a faster take off when loading is finished. Until the pick list is ready the plane cannot leave. How much fuel are you spending each minute you wait for the pick list?

Check your case, I'm sure that it falls in one of the three. And happy golf game.

Tuesday, 8 April 2008

You probably have a performance problem. You'll probably never know that.

Think about it. When was the last time you heard about your application being slow? Every day? week? month? When was the last time you heard from some business manager “we could do X more/better if the system could be more responsive”? Lots of times. How many processes in your business have their schedule and deadlines dictated by machine execution times? I'm sure that you'll find a few, some of them probably rated as business critical.

So you approach your dev team (either in-house or outsourced) and request from them some time to fine tune the application response or batch times. You always get some variation (or a mix) of the following two answers.

Answer 1: your application does not have any kind of problem with performance, even if somebody is saying that it is “slow”. None at all. It is already working 100% optimized. All the execution paths have been checked and tested, and each one of them is using the best possible algorithm. There are no places worth looking at in the code any more for performance improvements, none. If you want, or more likely need, to improve performance, just buy us some new fancy hardware.

Answer 2: we just cannot stop to look at performance. There is no time to do that here. If we stop and look at performance, we'll delay the next release, the next patch, the next whatever is in the works. We can always take care of performance later. Do you really prefer delay the next project milestone/business roll-out/bug fix in exchange of some nebulous performance improvement target? If you want better performance, just buy some fancy new hardware.

Frankly, both answers make perfect sense, and it would be very difficult to challenge them in a business context. First, you would need to have extensive knowledge of the inner workings of your application to challenge the first. Second, you would need to have a lot of muscle within your business organization to stand up and say “next release/next change process/next bug fix is not as important as investing some time in application performance”

This is business, not F1 racing. Just being the fastest does not give you any prizes. But performance is always on the back of your mind. You don't know quite why. Well, you do. But like the two answers outlined above, you are really lying to yourself. Because you know that, at some point in time, you'll be under the spotlight for your application performance. But is just better to postpone it because.... (insert some mix of Answer 1 and 2 here)

And therein the day comes where your application performance is no longer tolerable, even by the most generous and forgiving users on the face of earth. Even by the management board that have never, ever, seen even the application after its splash screen (well, they probably did in that demo, but they did not understood anything beyond that) And you'll be on the spotlight. And you'll go to your dev team, ask them to improve performance and you'll be told that... (insert some mix of Answer 1 and 2 here)

So you prepare an extra budget outside of plan, assemble a team of techies and buy some fancy new hardware. A migration project after that, your application is running on fancy new hardware. Surprisingly, performance improves. But only a bit. Not even remotely proportional to the money you've spent. What happened?

Simply put, answer 1 and 2, or some mix therein, are not true. Well, they are true, until they have to be proved. And that moment comes when you actually need better performance. This is the time when you realize that your application has a performance problem.

Granted, some applications are not like that. I'm sure that anywhere with human lives at a stake, from medical equipment to air plane controls, performance is taken very, very seriously. Outside of those fields, however, performance is just another factor in the balance of costs and benefits.