Problem to Solve – My company’s most critical application runs under the IBM MQ Protocol. When we have performance issues, it is very difficult to troubleshoot. What can I do?
GUEST BLOGGER – Mark Fink, Network Engineer with 20+ years experience
IBM WebSphere MQ is a widely used middleware messaging system that facilitates large-scale, high-volume data processing. The reasons to use MQ and the benefits are many, but chief among them is the scalability and protection it affords you. Especially when you cannot foresee the potential demand and growth of your application and that potential growth and demand is large, particularly with a public-facing Web application. Obamacare could’ve used a little MQ perhaps! MQ enables a huge transaction load in a manner that is secure and highly reliable (stuff can fail and your data is protected) and you just add processing as your needs expand. It’s a beautiful thing to a developer.
I work with companies on a daily basis helping them triage and troubleshoot exactly the sort of large-scale applications that call for MQ. A claims processing application for a large insurance company is an excellent example. They have transactions hitting the system from agents all over the US and from customers on their Web site. MQ is the middleman to the database servers from the different front-ends, ensuring data integrity across the board. And when something goes wrong with it, it’s big money and lots of heartburn for engineers trying to ascertain if the issue is actually with MQ or not. MQ is complex and troubleshooting requires MQ-specific knowledge which is generally the domain of a specialized group who handle MQ. It is a mystery to most everyone else.
However, given that MQ is a part of so many business critical applications, it is essential that we have the ability to quickly triage a problem to MQ (or away from it). And, when we triage a problem to MQ, to provide enough detail to the MQ admins that they can isolate root cause quickly and resolve it. It can’t just be a mysterious black box. Because, at the end of the day, MQ by its very nature is merely a piece of a larger system. And when it represents such a challenge on its own, it can add a lot of time to resolving issues with the larger system – and for something we assume to be mission critical and essential, to be using MQ at all.
MQ Intelligence is Essential
My own opinion is that a monitoring tool with MQ intelligence is essential. You can try to analyze MQ packet traces yourself. More power to you. But you won’t get far, at least not quickly. Here are the challenges:
- The specs are not public; IBM keeps them close to the vest. Vendors contract with IBM for those specs for a cost. And they have to recoup that cost.
- For the above reason, you won’t find freeware that provides what you need for MQ. There’s been some reverse engineering done, but the end result is far from complete and there is no guarantee it will remain current.
- MQ is complex. You have an interface to the Web and database and other servers that send and receive messages on it. You have a different interface between queue managers (MQ hosts). And for both, you have multiple transports ranging from IBM’s native MQI to HTTP to Java. Manually decoding MQ requires a knowledge of all this, along with good decodes. In my opinion, it is more than we can reasonably expect from anyone other than MQ specialists.
- IBM’s own tools for troubleshooting MQ leave much to be desired. The main shortcoming is that they are not designed to run continuously (they are reactive). And they are not designed to be used by a front-line service desk. So you will be very far into the triage and troubleshooting process, with probably a lot of heartburn and stress, before these tools come to bear. And they require that you have a pretty good idea of what you’re looking for even then. And they are the domain of your MQ specialists, so their involvement is implied.
- APM tools that use MQ server agents are ignorant of the network. Which is to say, while they provide excellent data specific to MQ, they do not help you with the part of the resolution process that consumes the most time: triage. In other words, these tools are great once you know the problem is with MQ. Before that, you’re just searching for something that may or may not be there (and when the problem is not there, you then bear the resentment of folks feeling their time has been wasted). And these tools still require expertise and involve your MQ specialists, etc. They don’t save time.
Best Place to get the MQ Intelligence?
A packet-based product with MQ intelligence is essential because it’s the only vantage point from which you can analyze all layers of the OSI stack at once. This approach prevents impacting the MQ servers themselves, all while rolling the data up into one dashboard facilitating fast triage of network versus server versus MQ-specific issues. Packet-based products can parse data by MQ queue (which gets you a long way on its own) and they can show you any MQ-level errors traversing those queues. You then merely need to be familiar with your queue names and which queues are associated with your critical systems and applications. And let the tool tell you the rest.
Common MQ Issues ….
There are a variety of common MQ issues a packet-based product can see and show you:
- Errors with MQGETs and MQPUTS leading to failed transactions.
- Hung queues that stop delivering messages altogether.
- Long delays on message transmission due to MQ bugs! There have been several over the years. You upgrade to a new MQ version or patch, and all of a sudden there’s a 20-second delay on MQDISC commands to the servers. It was easy to blame the network on that one; it took awhile to prove it was MQ.
Even if the tool doesn’t go all the way and identify absolute root cause of an MQ issue, the triage is the larger point. Accurate and fast triage eliminates 50 people from having to be in a virtual war room guessing about the problem for 5 hours. You can know it is a network or an MQ problem. Then you kick it over to the appropriate team with detail so they can resolve without guessing and finger-pointing. You only involve MQ specialists if they need to be involved – which is exactly what they want! Everyone wins!
Now, such tools did not exist until recently. But they do exist now. And it behooves any MQ shop to consider them.
What Do You Think??
- What key applications does your shop leverage MQ protocols?
- Have any other additional MQ challenges or war stories to comment or add?
- What processes or methods have you been able to use to solve MQ issues?