Wednesday, June 6, 2012

Few thought on debugging

If you a developer who works on Distributed system, there is one thing you learn well. That is how to debug, and how to avoid having to debug. Following are some of my thoughts and somethings I generally do.

Debugging a Local Java Program

  1. If you write your program well, generally you will have a stack trace when you have a problem. (This does not apply well with performance problems and memory leaks. I will write a separate note about those. )
  2. Look at the trace; go to where the error happened. Try to figure out what happened. The best way to do this is by walking through the logic again.
  3. If that did not work, copy and paste your stack trace in to Google. About 80% of the time, you will find the answer there. Pay special attention to JIRA bug reports for the projects you are using and online forums like stackoverflow.com.
  4. If that did not work, you will have to debug. You can debug by running the code from your IDE (e.g. Eclipse) or by connecting to a server through a remote debugger. Walk through the logic using the debugger that generally tells what happened. For example, the article explains how to debug with eclipse and how to connect to a remote server.
  5. If none of these worked, now it is the time to go and ask for someone to help. There are some bugs that are very hard for the author of the code to see. However, you should go for help with problem recreated, debugger attached, and ready to let him step through the execution.
  6. If you have trouble with a specific tool, you can ask for help as user lists, forums, or general developer forums like stackoverflow.com.

Debugging a Distributed System

  1. Debugging distributed systems are hard. So best approach is to not to have to debug them. You can almost get there by writing unit tests and tests what you write in small steps. My preferred approach is to make one path work end to end, and do small changes while testing each change.
  2. Distributed system will have multiple Processors (JVMs). So it is tricky to debug them. If it is at all possible, find a way to run whole your distributed system within the same process (JVM). This will need some imagination from your end, but it will save you lot of trouble later.
  3. If you are debugging a distributed system, it is often useful to capture messages that are sent and received. You can do this via tools like TCP Monitor, SOAP Monitor, or Wireshark.
  4. It is doubly important to log all exception that can happen in your code. Otherwise, you will have no idea whether system worked or not. 
  5. I often append the timestamp and the name of process or host to each log line. One way to do this is by writing a Log4j appender. Time stamp and process or host name let me merge sort all the logs into one file and read the execution of the system in one read.
  6. It is likely that your distributed system process lot of messages. So it is very hard to read and understand the log. One way out of this is to trace every 1000th messages. I do this by having a message count and using.
    if(1000%messageCount=1){
       log.info(….);
    }
    
  7. If you running a complex system that has more than five nodes, you should invest in some mechanism to collect the logs using something like FLUME and automate their processing to find stack traces etc.

Tuesday, June 5, 2012

Scaling WSO2 Stratos

WSO2 Stratos is a Platform as a Service (PaaS) that offers middleware technologies like Web Services, Workflows, Messaging etc as a Service.

PaaS environments bring together many users and could potentially attract a large number of users. Therefore, scaling up is a major consideration for a PaaS. This post explain our experiences and some thoughts on scaling Stratos.

Problem

Stratos is multi-tenanted. In other words, there are many tenants. Each tenant generally represents an organization and isolated from other tenants, where each tenant has his own users, resources, and permissions. Stratos supports multiple PaaS services. Each PaaS service is actually a WSO2 Products (e.g. AS, BPS, ESB etc.) offered as a service. Using those services, tenants may deploy their own Web Services, Mediation logic, Workflows, and Gadgets etc.
  1. WSO2 Stratos runtime provides servers where each can support multiple tenants and provide a PaaS service. For example, there are multi-tenanted AS, BPS, ESB etc.
  2. Stratos can provision (add/remove) resources (computing nodes) as needed on demand.
  3. Problem is to build a system that scale up and down without end users realizing it.

Answers

Following describes a series of solutions while each solution adds a new feature to solve a specific problem. It explains the rationale and thought process behind the final design.

Solution 1
WSO2 Stratos consists of a multi-tenanted server of each type (e.g. ESB, BPS etc.). Users first talks to an Identity Server (IS), gets a SSO (single sign-on) token, and log in to any server. We stored all tenants data in a registry. Each server loads all tenants at the startup and can support any tenant when they receive a request.

Solution 2
Solution 1 does not scale at all. So we started running multiple instances of each server and put a load balancer (LB). LB load balances the requests to different servers. When load on the server instances are high, LB starts new server instances and when the load is low, LB shuts down some instances. We call this auto-scaling.

Solution 3
When Stratos had several hundred tenants and many tenants with tens of services, it took a long time to load all tenants at the startup. Start up took 15-30 minutes. Furthermore, most tenants stays inactive most of the time. However, since each node has to hold all tenants, Stratos spends resources for inactive tenants as well.

To avoid above problems, solution 3 added lazy loading. All information about tenants is stored in a central registry. Tenants are loaded into memory only when they are needed. Tenants get unloaded when they have been idle for more than a given timeout. You can find more information about Lazy loading from Azzez’s blog entry “Lazy Loading Deployment Artifacts in a PaaS Deployment”.

Solution 4
When tenants have several artifacts, loading them takes time. So if the tenant is accessed while it has not been loaded in solution 3, first request or two to the tenant will timeout.

Solution 4 added ghost deployer to solve the above problem. Ghost deployer does not load all information about tenants, but just loads the metadata. Actual artifacts are loaded on demand. As a result, loading a tenant has become a much simpler in Solution 4. So this avoids requests from timing out while loading the tenant. You can also find more information about Lazy loading from Azzez’s blog entry “Lazy Loading Deployment Artifacts in a PaaS Deployment”.

Solution 5
However in the solution 4, LB does not scale to handle a large number of requests. So we replace the LB with multiple LBs that have same metadata in all nodes. Therefore, all LB will take the same decision when it received a request. We can use this model to scale Stratos by placing a hardware Load balancer or setting up DNS round robin to distributed requests among LBs.

To synchronize the metadata across all LBs, we can use group communication. That is a MXM communication, which is heavy. Instead,  LBs in Stratos are designed to send updates as batches to a single decision service, and the decision service takes auto-scaling decisions. We enforce High Availability by running a replica of the decision service and keeping it up-to-date via state replication through group communication.

Solution 6
However, in Solution 5, LB instances are not aware of tenants. Due to lazy loading, all requests will work even through LBs route messages arbiterly. However, this might lead to a scenario where a single node has to load too many tenants.

To avoid this, in the solution 6, LBs are aware of tenants and allocate only a subset of tenants to each LB. You can find more information from the Sajeewa’s blog entry, WSO2 Tenant Aware Load balancer.

Solution 7
Upcoming Stratos release will follow the solution 6. However, the next potential problem is that Registry which holds the configurations and resources of all tenants could not scale to handle a large number of tenants. Hence the registry needs to be partitioned across multiple users.