Monday 28 July 2008

Please don't swap

While browsing the database forums, it has come to my attention that some people are looking closer at their database system memory management and trying to adjust its configuration settings.

Of course, the more complex the database the more parameters you can change. RDBMS's like Oracle for example have dozens of different parameters that can alter the memory usage of the database to better fit with your application needs. This is reasonable, and I've seen spectacular improvements on specific systems just because a single memory setting was changed.

As always, the usual rules of tuning apply here, change only one thing at a time and change only when you need to do it. And like any other tuning effort, the process should iteratively make a change and test it.

Different operating systems and databases have different means of checking the results of your changes. The most expensive ones like Oracle and DB2 have even dedicated packages from third parties that can help you monitor detailed usage of resources, CPU, disk IO, cache and memory usage. The less sophisticated ones may lack those capabilities, but usually the operating system has already enough facilities (/proc filesystem, Windows Performance Monitor) to give you at least a good overview of the resources that are being used and how much.

It therefore follows that each change is usually followed by a number of tests that try to reproduce the most stressing scenarios and evaluate the performance of each. Let me say that this is sometimes a nightmare. I'm happy with the scenario "I have this 2 secons query and I need to execute it in 0.02 secs" because that makes for quick and short tests, and those allow a much more rapid change/test cycle. But sometimes when you're dealing with processes that take eight hours to execute, it's a pain in the neck to have to wait from one day to another just to know the effects of a change.

One of the closely monitored parameters is always, or should be, swap file usage. But I've read some comments that somehow were implying that if your system was not using the swap file it was not correctly set up. Let me dispell that myth once and for all.

The origins of the swap file


In the beginning there was memory, just one kind of it. The primitive computers fifty years ago did not have the distinction between RAM and disk memory. All memory was just memory, and some rotating drum or other device provided that. Then the transistor industry began the craze and the distinction between fast memory (RAM) and disk memory was born. Essentially, the CPU can use a limited (but the word "limited" applies to 2GB here) address space of memory that can be used in a matter of nanoseconds. Were this memory the only available, most enterprise databases of today would not exist, as they easily exceed that size without a blink. That's why there's a second type of memory which is usually called "mass storage" or just the disk, that has much much greater capacity. However, as always there's no free lunch, and this memory is much much slower to access. And much slower means 100 or 1000 times slower.

That's why the operating system always tries to load into RAM the program instructions that have to be executed, as it would greatly increase the time of a program to run if its instructions had to be fetched from the slow memory instead of the fast one. Other data, such as the application own data, is usually stored on disk and recalled on demand. Of course, when that happens the program execution slows down enormously, as it has to wait for the disk to retrieve the requested piece of data, but not as much as if the instructions themselves were retrieved from the disk one by one.

This balance of disk memory and RAM memory has served the computing landscape well over the last thirty years or so. As applications became more complex, they required more and more memory for its instructions to be stored. As applications handled progressively biggest amounts of data, they required the auxiliary, slower memory, to be bigger. Technology was always provinding updates in the form of fastest CPUs, fastest RAM and fastest disks. Faster RAM has remained to this day a scarce resource in comparison, with disk drives reaching terabytes while RAM merely increasing in gigabite steps. Oh, and way more expensive than disk space. Thus, RAM was a resource that was carefully administered and assigned by system owners, because a program that could not allocate the amount of RAM needed in the machine would usually stop dead cold its processing thus frustrating the run.

However, as computers were applied to more problem areas, it became clear that it was just not possible to anticipate in advance how much RAM you were going to need. Perhaps in some situations yes, were the number of applications and its usage pattern was precisely know, it was possible to specify how much RAM you would need at each point and thus have enough of it in the machine.

With the advent of the workstation and the multiuser machine this became impossible. Since everybody could use the machine at any time with any mix of applications, it was no longer possible to delicately administer the RAM so that every application could run smoothly and not stop because of lack of memory.

Operating system and CPU designers realized that it was not necessary for a machine to actually have all the memory that was being demanded by the applications being executed. It was enough to appear to have it and provide on demand the regions of memory that were used at some point. This would also relieve from the need of manual intervention and adminstration of machine resources. They could just use the fast RAM memory until exhausted, and at that point start to use the slow disk based memory as applications demanded more memory. Of course, at the price of performance, but the reasoning was that it was better to have a slow running application than no application running. Thus, virtual memory was born.

Along those basic concepts, of course the technology has evolved to a point where virtual memory has become a mix of art and science, and every operating system that is in popular use contains a fairly sophisticated implementation of it. Jargon like "commited", "dirty", "shared", "working set" and other words are abundant in the literature, and I confess to know only the basic meaning of them, enough to get the work done. But it's a fantastic facility because it allows you to basically use your machine like it had no limits on its physical memory and get reasonable performance out of it. It's a fasciating dance to watch how the machine allocates and releases virtual memory as you open and close applications or even perform certain tasks with them.

Ok, I hear you ask, what is the "swap file" then? Generically, the place in the disk where the operating system puts the portions of RAM that are being faked as actual RAM is called the "swap file" In some operating systems this is not actually a file but a dedicated disk partition or even a different physical disk.

Why swapping is bad for your server


Of course, this virtual memory thing is all well and good, but it adds one element of uncertainity to the run time of a process, something that is sometimes undesirable. In fact, real time or close to real time systems usually forbid virtual memory altogether to be able to maintain its commitment to execution time. And it works to a point, because when your machine reaches the point of too many processes competing for the scarce RAM space, the machine spends more time bringing memory pages from the disk than actually doing any useful application work. This is what is called trashing and is bad, very bad for performance.

Workstations and servers - two different worlds


All the above applies very well to your workstation or PC machine, as you're usually using it without any predefined pattern or restrictions. You start your word processor now and need some numbers from your spreadsheet, then decide to put all together in some slides. You launch programs and close them without being really aware of how much RAM you're taking up. And you're willing to put up with a small performance loss as your systems start swapping (which is the word used to denote the point where your machine runs out of actual physical RAM and starts using the disk as a slow replacement), as it surely the convenience beats the need for yourself to organize and manage the execution order of your applications. As long as the illusion works, you're living in a fantastic world of almost unlimited memory.

Life of a server is very different. There are some characteristics that make a machine we call a "server". For one thing, it does not usually sits on your desk, but lives in a conditioned place with redundant power, protected from excessive heat or cold, and its access is restricted by a locked door for which only a selected few have the key. It's components are usually higher rated in reliability than its desktop workstation counterparts, even having probably not cutting edge performance.

All that is for a good reason, because a server usually performs some specific function on which the work of many people depends. There are file servers, application servers, database server, authentication servers, proxy servers, directory servers, name servers, and so on. Sometimes, the same machine is performing more than one role, specially where there are no resource usage conflicts between each service. But all of them are usually performing some essential function and either some other system or business process depends on their availability.

A server has to be available and usually have reasonable response time. A server runs a fixed set of processes or services, and usually does not need to deal with unexpected situations like someone sending you a 120MB PowerPoint presentation that you need to print in colour 12 times.

The corollary is, your server memory usage should remain more or less constant. Except under situations where the workload exceeds the planned capacity, your server should be happily working and its resources (CPU, memory, disk) should be used fully. The essential, unforgettable word here is "fully" Fully means completely, but does not mean "in excess"

And "in excess" when talking about memory means swapping.

Memory usage is the key


Memory is the fastest resource available to the CPU. So don't waste it. Check the documentation that comes with your database. The most mature ones have configuration parameters that allow you to precisely tune the memory usage of the database, and be aware that some of them (for example, MySQL) do have different parameters for different storage engines. So your memory usage with MyISAM tables is radically different than with InnoDB tables (by the way, try not to mix the two in the same instance) As always with database tuning, measure, tweak a single parameter, measure again and continue until you reach the good enough state.

But above all, don't allow the database to swap. You'll find documentation that says that having a swap file or partition of at least double the RAM size a good thing. It does not mean that you should use that swap space. It's there just in case, not for regular use.

How much memory should I use?


You may hear from other people that if you're not using the swap file you're not fully using your memory. They are wrong, it's a common misconception that comes from the confusion between physical and virtual memory.You may hear that if your system is not using 100% of physical RAM you're not doing it right. You may hear that if your database is using more than 50% of physical RAM your're not doing it right.

Those are the kind of "myths" that I hate. It all depends on the database server, and even the storage engine used in the case of MySQL. There is no "right" answer that applies unversally.

Oh, yes, there is in fact an universal rule. Repeat with me, don't swap.

Thursday 10 July 2008

Procedural programming and SQL don't mix well

Some of the biggest performance problems I've seen came from perfectly reasonable decisions made from fairly intelligent people. Specially, good developers coming from procedural programming languages trying to adapt the paradigms they know to the SQL language. I think that there's a strong parallelism between these cases and some of the most cited problems with people new to object oriented programming: at first, they program with the OO language but they are not really using objects, creating instead one big class comprising 90% of application functionality that essentially contains a well structured procedural program.

Data sets versus control flow

At its hearth, SQL is a set oriented language. Its main constructs and the philosophy behind them is that you're manipulating sets of rows. Years of research have been invested so that you are as abstracted as far as possible from how the database actually does that, and to do that in the most efficient way. You just write the columns that you want, the relationships between your tables and the conditions that you want to meet and off it goes. Behind that there are sophisticated tuning engines, monitors, clusters, load balancers and god knows what else, all looking to give you the results in the fastest possible way without you having to know how they actually are performed. Years of research and tons of money have been throw at the problem of how to do that in the most efficient and safe way.

In sharp contrast we have the classic procedural languages. Basically, those are a set of constructs (data types, structures, control flow) that allows you to exactly specify to the machine the instructions you want to execute. Notice that the procedural programming language is not concerned with why you want to do this or that, much less with the correctness of your steps. It just wants a set of instructions to execute on a set of data. Over the course of time and thanks to some very bright people, procedural languages have been improving, trying to make the job of the programmer easier. Mostly by abstracting, encapsulating and hiding the parts and details of the problem not relevant to your main concern.

Those are very different concepts. On one corner, a language designed only for data manipulation in the form of sets of rows with columns and relations amongst them. On the other, a set of primitive instructions that, like Lego blocks, allow you to do anything.
Needles to say, procedural languages are very powerful. They allow you, depending on the level at which they operate, to specify all the details of what you want the machine to do. They are the lowest level of abstraction, and with that comes the price of complexity. Your domain specific language sacrifices flexibility and applicability giving you in exchange much more productivity as well as a view of the world that more closely matches the problem you're trying to solve.
SQL is a domain specific language, its domain being the manipulation of sets of rows with columns that have relationships amongst them. It happens to fit a lot of scenarios, specially business automation problems where machines are replacing and augmenting manual, well specified processes. But SQL sacrifices detail, from the concrete operation of the database to its inability to perform seemingly simple tasks. How do you open a window in SQL? How do you print a report in SQL? The answer is, you don't. Well, that's not strictly true, some extensions of SQL can allow you to do such things, but they are not really SQL. They are hacks built into the language so that managers and programmers convince themselves that they don't need to learn another language.

In fact, they are really learning to fit two paradigms in the same language, but marketing is calling it the same as what they already know so that it's easier to sell them the whole thing.

Some examples


The following are real world examples of these types of problems. I've found them everywhere, from SAP to custom built J2EE applications. Bear in mind, those are general rules and you'll sometimes have a valid exception for them. But surely it's not going to be the first, second or third time you encounter this. Exceptions to these rules are very, very uncommon.

Loops versus single SQL statements

Ignoring transaction isolation levels for a moment, you will agree that the following pieces of code are equivalent:

FOREACH R IN (SELECT * FROM T WHERE T.B=2)
UPDATE T SET T.A = 1 WHERE = R.
LOOP

UPDATE T
SET T.A = 1
WHERE T.B = 2;

And yes, they do the same thing. But the seasoned procedural programmer delving into SQL tends to write the first version, because of the feeling of having complete control. But, and this should not come as a surprise, the second version usually performs much better, or at worst on par with the first one, is shorter and easier to maintain. This, taken to the extreme, makes people write:

A = MIN_INTEGER;
FOREACH R IN (SELECT VA FROM T)
IF R.VA > A THEN A = R.VA;
LOOP;

instead of

SELECT MAX(VA) INTO A FROM T;

and this time, believe me, for enough rows in T the performance difference is spectacular. Of course, in favor of the SQL statement.

Function calls in WHERE clauses

The following two pieces of code are also equivalent

SELECT * FROM R WHERE SQRT(R.A) = 2;

SELECT * FROM R WHERE R.A = 4;


But as you have guessed from the subject, the second is usually much faster, because you can use indexes to look up the values of R.A directly. Oh, yes, the latest version of your favorite database has function-based indexes, I hear you say. Well, what if you need to look up values based on the results of three different functions? Are you going to create an index for each and every different operation you perform? Function-based indexes are one of those features that can save you from disaster but only if you're one of the selected few that actually needs them, but are not useful on the general case.

Function calls versus in-line calculations

This one has already been mentioned in this blog. It is very nice to have a function like getCustomerCredit(CustID) that calculates the credit rating for a single customer based on its order history. That way, your procedural code that deals with setting up the screen display for the order entry can just do a lCustCreditLabel.Value = getCustomerCredit(custID). But when the procedural programmer is tasked to create a printed report of customer ratings, the first version is always:

SELECT custID, custName, getCustomerCredit(custId)
FROM customers....

Probably getCustomerCredit looks somewhat like

SELECT Customer.creditLimit - SUM(orders.OrderValue)
FROM customers, orders
WHERE customers.custId = :custId AND orders.custId = :custId AND orders.stat = 'OUTSTANDING';

Instead, the following alternative will be far, far more efficient

SELECT custId, custName, creditLimit - SUM(orders.OrderValue)
FROM customers, orders
WHERE customers.custId = :custId AND orders.custId = :custId AND orders.stat = 'OUTSTANDING';
GROUP BY custId, custName
;

Replacing built-in SQL functionality

This is probably a repeat of all previous ones together. When something can be done with raw SQL, it is almost always faster and better to do it in raw SQL. Fancy frameworks may hide some of the complexity, but will make you paying a price. And this price is usually performance.

You can have some structure in your SQL


As you have seen in the examples above, the more efficient alternatives are breaking structured programming, abstraction, detail hiding and probably some other principles of structured programming in some way or another. After you've worked long and hard to encapsulate the business rules in a way that avoids repetition and hides complexity, I'm basically telling you to throw them away in favor of better performance. I don't have an answer, and I think that nobody has, to this dilemma. Possibly because SQL was not designed to abstract business rules or abstract data structures beyond tuples (rows) with attributes (columns)

That's not to say that you cannot have any kind of structure in your SQL. Quite the opposite, you should strive to have it. There's nothing worse and unmantainable than an application that is hand crafting SQL sentences and sending them to the database to execute for each little thing it does. But always keep in mind that abstractions are useful as long as they pay off. Don't abstract breaking the assumptions made at either language desing or database engine development.
For example, it's very common and good practice to abstract business rules (data manipulation and validations) in stored procedures.

I'm not advocating cut and paste programming or that you should not try to be as structured as possible in the coding and design or your applications. But you really have to be aware of the trade offs, and above all, don't try to reinvient the wheel. The designers of your RDMBS spent probably a long time thinking about how to make their engine efficient and easy to use. Don't fight with them, take the red pill and do things its way. It's better on the long term.

Thursday 3 July 2008

Maintenance programmer and performance

It's almost universal. And unavoidable. Yet I'm not sure if it's right. But it's true. At almost any organization whose goal is to develop and maintain software, there is a knowledge based hierarchy. The best and brightest individual tend to take care of the more complex issues with programming and architecture. The juniors and newcomers are assigned lowest value added tasks. Usually, organizations tend to offer career options and other incentives to their best people, with the hope of retaining them and extract more value from the time they are in the office. The less experienced, productive or talented people are invariably relegated to jobs that are perceived as adding less value.

Agile/Extreme Programming methodologies try to get away with this segregation of categories by using "Pair programming" and making everyone accountable for all code in the system, but this is in practice difficult to achieve. It needs to have a more or less homogeneous team in terms of experience and capability, as without an equal footing "pair programming" ends up morphing into some sort of mentoring scheme. In that scheme, pairs are allocated in such a way that there's always a "master" on the subject working together with someone less experienced. In an ideal world, they will become true pairs given enough time, but time is money for a business that wants results here, now and cheap. Nonetheless, and although I don't have field experience on that, I think that pair programming properly done can be one of the greatest assets of a programming team.

It's after all the best business interest to maximize the return of every penny they put at risk (read invest in a software package), but this hierarchical scheme it does usually have an impact on the way the software maintenance is done. Because the best people is usually assigned to the "best problems", and maintenance is often consigned as a necessary evil.

I've never understood that point of view. All the literature about business software development is sending the same message about emphasizing readable and clear code based on field data that proves that maintenance is often half of the total cost of any application, if not more. It should follow logically that assuming that maintenance costs are high, maintenance taken as its own profit centre should be given the importance it deserves. But in fact it's not often done.

Part of the reason, I suppose, stems from the wide scope of the word "maintenance" From simple changes in user interface or report layout to whole data model changes, all of them often fall into the same bucket of "maintenance". While obviously the profile and capabilities needed for one task versus another are deeply different, the fact is that the same resources are assigned usually for both kind of tasks.

And that invariably impacts performance. And I'm sure that even strict adherence to the most rigid change management methodology will still leave business systems to unexpected performance problems, if only because assessment of the performance impact of changes is still an unexplored area.

Add a bit here, add a bit there.


Let's tell a story that illustrates this very well. Bob is an entry level developer assigned to minor enhancements in the company ERP system. One day, Bob receives a call from a change manager that has already completed the immense paperwork required to add a simple data point to a data entry screen. When entering a customer order, the system has to display the outstanding value of all orders pending to be delivered to the customer.

While Bob is relatively new to programming, he is able to complete the task and creates a simple subroutine (or method, or user exit) that computes the outstanding value of all orders for an individual customer. The change is tested, moved to live, and for Bob it's all well and good.

Months later, some internal control manager which is being part of the internal audit team needs to take action around a point remarked in the last audit around the excessive number of customer orders entered that exceed the per customer agreed order volume. He calls for a change so the data entry clerk can spot not only the amount but also the outstanding volumes.

Bob receives the request and changes the subroutine (or method, or user exit) to display that information on screen. It makes sense to do it at the same time that the order value is calculated, since they are closely related. The change is tested, moved to live, and for Bob its all well and good. So good in fact that he's asked to create a report listing all that information, since the customer service manager will start its daily operations meeting with that information on the desk for the team to arrange and prepare shipments.

Being smart, Bob creates a report that calls his function once for each customer in the customer master, checking for the outstanding order volume and adding it to the report if it has some. The report runs a bit slow, but that's not a problem because the report is run overnight. Change is tested, moved and everything is good.

Bob moves into other projects, and eventually leaves the company. He's replaced by Sean, an equally talented and motivated person as Bob, who later on receives the request to change the report. As the customer service manager focuses his daily meeting on the activities done in the next 24 hours, he needs to split the outstanding order volume by delivery date, so as not to waste time reviewing something that it's not an immediate concern.

Sean examines the report and without hesitation, changes Bob function so that it breaks totals by delivery date and adds them up at the end. Of course, the function is even slower and the report more so, but for Sean is the most logical way to complete the work since it involves the least amount of change and effort. Also, the report is giving 24 more times information, so it's reasonable that it takes more time. Oh, besides that the report is run overnight, so performance should not be too big of a concern...

Sean's change is tested, moved to live and all is good... well, not. Now all the data entry team is screaming because they cannot enter orders. The system has become inexplicably slow and it takes ages to do now what in the past was acceptable, if not exactly snappy. Of course, the explanation is that the same program logic is being used for two very different contexts (night batch report and live data entry) with very different requirements. Each time the program is changed it makes perfect sense in that concrete context, yet the cumulative changes lead to a problem.

I've noticed this pattern a lot of times happening in the real world (in fact, this story is true except for the person's names) The problem would have been avoided if close testing of the whole system were done, or if close inspection of code revealed its usage in multiple places, and possibly the conflicting uses of this.

Of course, had Sean or Bob been mentored, they would have included some comments or documentation somewhere. Had complete performance testing of the whole system been done the problem would have been detected. But who does that for a seemingly innocent change in some report output or for an information only label on a data entry screen?

The irony is that, when the problem is detected, poor Sean is blamed for its poor coding practices, while he had few choices in his way to proceed with the change. Had he created another procedure he would be seen as "copy & paste" programmer, not taking advantage of functionality already there and investing more time than necessary in the change. Had he said "look, it's very inefficient to loop over customers just trying to find one with outstanding orders, let's try to make this the other way as it will be more efficient" he would have been said that performance was not that important. Had he spotted the actual problem of reusing the code because it was used also in an interactive data screen, he would be said that he was not assigned to that part of the code.

The curse of the maintenance programmer


All of the above conspires against the maintenance developer. Regardless of their experience and skill level, they will be always caught in the middle of discussions. The IS side will argue about costs and resource limitations and the business side will argue about costs and deadlines.
Plus, an existing application forms part of a closely interweaved link of process, knowledge and business practices. It's not easy to justify any refactorings once an application is live, if only because the benefits are never apparent in the short term.

I've seen very talented developers despair when being assigned maintenance roles. Not because of the quality of the challenge or problem they have to face, as some systems can be more complex to maintain than to develop given the number of additions that have after its go live. No, the reason for their despair is that they have to deal with a codebase that is inherited from not only the original developers but also anybody else that has been it their chairs before. When they are asked to do anything with the code, and after overcoming the "complete rewrite syndrome", they may feel comfortable with it. However, when faced with change they are always compelled by management to take the shortest (read cheapest) route. As we know from the travelling salesman problem, this is not always the optimal solution.