Stress and Frustration at Work
by , 10-18-2012 at 10:10 PM (4352 Views)
I hate to write an angry blog. I have so many ideas for positive blogs just waiting to be written (Walt Whitman’s poetry, the great baseball season the Baltimore Orioles had this year, coping with Matthew’s terrible twos, etc.) but time and laziness has prevented me from writing them. But venting is an incredible motivator, and so this one gets written. Be aware this is a work related blog, so names have been changed and details are avoided for internet anonymity. Not even sure if anyone here is even going to care about this one. It's long and tedious.
Six months ago, the week before an upper management review of my project, one of the principal sub elements to the system we’re developing had a major failure with catastrophic breakages. To give a little background, the system we’re developing is comprised of three principal sub elements and when you integrate it all together it will be a unique system. Actually all three of the sub elements are fairly unique in themselves. That’s about as specific as I want to get. The broken sub element was our only prototype, and if we wanted to build another from scratch it probably would have taken a year and a lot of money. But it was repairable. We just had to make sure that whatever caused it to break was understood and we addressed the design. I briefed the incident to my management, and they gave me the go ahead to figure out what went wrong and redesign and rebuild it. I came up with a nice program plan change where we could progress on the other two sub elements while the broken sub element was rebuilt. And if everything worked out as plan, that is the broken sub element back in functionality in six months, I would actually catch back up to the original schedule, and other than the cost of designing and refurbishing there would be no major impact.
It took almost two months to do a root cause failure analysis which identified a rather minor design change, so minor that we were kicking ourselves for not having that in the original design. Four months after that, the broken parts are replaced with newly fabricated parts, put together, and ready to be tested. Right on schedule, and that was last week with coincidentally another management review the very week after.
And I was preparing for the management review all last week. It normally takes me a solid week to put the brief together, and that’s even with taking work home. At the beginning of the week I felt a sinus infection, and it kept getting worse so that by Wednesday morning I had this sharp splitting pain in my head behind my sinuses; it felt like my head was going to explode. I had never had sinus pain like that before. There was no way I could go to work. I needed to get to a doctor right away. So I took off, got to the doctor for antibiotic prescription, knowing full well I would be working over the weekend to get my brief done. I sent an email out from my blackberry wanting to know how the test of that sub element went. I never received one. Thursday morning I got back to work (the antibiotic worked remarkably fast) and played the messages on my phone. The sub element under test the previous night had had another catastrophic failure in much the same way it did six months prior. I just tossed my pen that I was writing with across my desk and against the wall.
So not only was I behind in my brief but I had to rework parts of it to address what just had happened, and without really understanding the cost, schedule , and technical implications of this second failure I had to re-strategize to at least project some level of confidence that this is still a viable program. With poise, honesty, and even humility I gave a pretty decent presentation. Though there were some unhappy faces I did not get beaten to a rhetorical pulp as can happen in these briefings. The most critical statement came from one manager who said he wanted to know what I was going to do differently this time to getting to the cause. Fair enough. I certainly didn’t intend to go down the very same path as before.
And the implications of this second failure were graver. That simple fix of the first breakage took six months to get it back functional…or rather functional enough to re-fail. A major redesign could mean a year, and now the other two sub elements have gone as far as they could without integrating with the third. There just isn’t much for the people on the other two sub elements to do until the broken sub element came back on line. I can’t pay people to sit around doing nothing for six months to a year and wait for that third element. I’m probably going to lose those people to other programs and there’s no guarantee I’ll get them back when I need them. Or they may just be let go, given the economy the way it is, and it’s not getting better. And finally if it does take a year to refurbish the broken sub element, then the timing of this completed system will not meet the customer’s needs. A year’s schedule slip could mean the program’s termination.
So you can see the stress that we are under.
Right after the management brief, we held a team meeting to plan a go forward strategy. The team members who did the root cause analysis six months ago were in denial. They really could not believe what had just happened. Embarrassed yes, but beyond that—stunned. The design change either had no effect or if anything actually made things worse. Let me bring in George (name changed). He’s the expert on this sub element, one of the original designers. He’s got a very strong personality, gets incredibly prickly if he or his work is criticized, and because of his personality and the status of his expertise, he dominates his sub team, so that his team members are overly differential. He was particularly embarrassed too.
And he ought to be. Six months ago after the first failure I questioned the way the sub element had been fixtured. It was different than originally fixture where the vibration was causing it to not function properly. So George designed a super stiff fixture, and for the first couple of runs worked. I didn’t particularly like it before we tested it (it just seemed to be too extreme a solution) but it seemed to work. Well, shortly after that it failed the first time, and as we were brainstorming I mentioned the stiff fixture. I wasn’t an expert on that technology, but when I actually did design work back in my younger days, I was pretty decent at stress analysis. When you increase the stiffness, the loads on parts increase because they don’t dissipate elsewhere. If there is a part that is borderline to failure, the increased load could push it over to failure. That’s basic stuff. George disagreed. He insisted that the fixture was decoupled from the load transfer (in retrospect, how could that be? It can’t.), and, when I pressed his team on this, in his prickly manner said that in his “engineering judgment” built from twenty-something years designing this technology it wasn’t the issue. However to satisfy me they would include it in their fault tree possibilities.
Ok, a couple of months later they completed their root cause analysis—which I thought was fairly thorough—and they identified a single part that was improperly designed. The increased stiffness issue was deemed a possible contributor to the failure but only because the incorrectly designed part couldn’t accommodate the new loading. A single part needed redesign with no changes to anything else. Alright that seemed to jive. So off we were to manufacturing the broken parts and the new part.
But after that analysis we had to address our risk mitigation plan. A risk mitigation plan (or some people call it a risk register) is a documentation of all the technical and programmatic risks that might be encountered, assessed and scored (usually low, medium, or high), and those rated high risks mitigation plans are put into place and with a backup plan if the risk is actualized. I called George into my office to come up with the new risk evaluation for the broken sub element. Risk mitigation planning is something I take very seriously, and I go to great effort. People don’t realize how important understanding risk is to a project. As you can see a failure can alter the best laid plans. Wikipedia has a good write up on risk management (http://en.wikipedia.org/wiki/Risk_management) if anyone is interested.
So sitting in my office we get to “robustness of parts insufficient” for his sub element. Prior to the failure it had been rated a moderate risk, given it had worked a few times, though with vibration. Without even a thought that it would be controversial I said that this had to be a high risk until proven otherwise. Well it was controversial. George didn’t like that at all. “What,” I exclaimed, “you want me to not raise a red flag after this had just broke?” “Yeah,” he insisted. And here was his logic. The sub element had gone through design and was working for a while. We failed because of a single part. We analysized it to death, so that now we completely understood it, and so it should stay at moderate, if not actually be reduced to low.
“What!? You think because you analysized it you have reduced the risk of something we know has already failed?”
“Yes. It was a moderate risk before the failure and now it should at most be a moderate risk again.” This went back and forth for a while, each time our voices getting louder to emphasize our points.
“No f’n way am I going to stand up in front of management and claim we have actually reduced risk after we just broke the damn thing and it’s going to cost hundreds of thousands of dollars to refurbish.”
“You’re wrong.” And he gave me this reasoning. “What’s the risk of being hit by lightening? Low, right? Let’s say you get hit by lightening and you survive. What’s the risk of being hit by lightening again? The same. It hasn’t changed. Same thing for the sub element. It was rated moderate; it broke; we fixed it; now it’s the same risk as before.”
I was flabbergasted. My Systems Engineering Lead, who was also in the room actually started nodding in agreement with him. Well, he should have known better, but I think he was more intimidated with George’s elevated voice than really using his brain.
“No, it doesn’t work that way,” I yelled back. Now on the spot of the argument I couldn’t articulate the technical flaw in his reasoning, but I knew he was wrong. Here’s why he was wrong; I was able to think straight without the adrenaline flowing later after the meeting. Being hit by lightening is a known statistical risk because enough events have been tabulated . This blog (http://www.stumblerz.com/chances-of-...-by-lightning/) claims it’s 1 in 280,000, which is a lot higher than I would have guessed, but it can be approximated by statisticians. George is right, it doesn’t change if you’ve been hit once. It’s the same risk. Here’s the difference. The sub element that broke has no statistical history associated with it, or not enough. When we say it’s moderate or high, it’s judgment assessment based on very limited history. All I know is that one out of the five times we actually operated the damn thing it broke. That’s high in my book. But there aren’t enough statistically valid events to base a prediction. In contrast to something you might identify with, take the water pump in your car. The car company has made millions of them over the years and have built up a statistical database of testing and collecting car histories. They know (I’m making these numbers up for illustration) that there is a 95% probability it will break after 80 months of regular car usage and a 5% chance of breaking after 36 months. So you could say you have low risk at 36 months and a high risk at 80 months. George’s sub element doesn’t have any history to know a statistical probability. So it’s a judgment, and given it’s already once failed and it has only been operational a few times, one has no choice but to say it’s a high risk. And thank goodness I left it at that.
George doesn’t know what he’s talking about. After he saw I wouldn’t back down he said, “ok, you’re the program manager, you can do what you want. I just don’t understand risk.” And with that he left the office. Now George’s elevated voice was not a result of anger. It was just an increase in volume to emphasize his points, and because he was trying to impose his will through that volume, I reflexively raised my voice to counter. There was no animosity. I was surprised later at how loud we must have gotten. The system engineer that was in the room was completely taken aback. He would have folded to George’s voice. People from outside the office stopped me to ask what happened since they heard the shouting. It was a very memorable moment.
And I’m sure George sitting in the room with the entire team strategizing after the second failure was remembering both the stiffness discussion and how wrong he had been on the risk. How could he not. And after the team consensus was that the increased stiffness had to play a part in the failure of both breakages, I was burning inside. F’n sh*t.
And it doesn’t stop there. Here’s George’s recommendation. Since we’re under a time constraint, let’s rebuild it again and run it at a lower speed (lower forces on the parts) and accept this capability with the understanding that we’ll have time to redesign it for the customer later on.
What? Build it again without verifying what went wrong? How do you know what speed it can operate at? Even if I got the ok to do that, which I doubt I could, can you imagine standing in front of management with a third f’n breakage?
No. We are going to understand exactly what happened and re-perform the root cause failure analysis and this time I am adding someone outside the current team to manage the effort and another stress analyst who I have complete confidence in to be part of the failure team. We need fresh eyes looking at the problem, and I don’t care what the so called experts on this technology seem to know.
And that’s if we even have a program. The day after the management brief I got a call from what I’ll call the “chief scientist” of the company who wants to meet to understand the technical fundamentals of the system in meeting its requirements. No problem there. I have high confidence in our modeling and system approach. But in five years I’ve been running the program I was never questioned before. And then the day after that, the assistant to the vice president of R&D called to set up a meeting (Friday morning 8 AM) to sit with several of the upper management to understand the viability of the program. Yeah, that doesn’t sound good. I feel like I’m a character in a mafia movie being asked to meet with the Godfather where the likelihood is I’m going to get whacked as I step into the room. That may be a closing joke, but I’m not a happy camper.



