Wednesday, September 5, 2007

Engineering failures - or 20/20 hindsight?

A few minutes ago I read an email saying Palm is withdrawing the Foleo platform at the twelfth hour - http://blog.palm.com/palm/2007/09/a-message-to-pa.html.

Yesterday, I heard a VOIP BlueBox podcast (http://www.blueboxpodcast.com/) arguing the relative merits of the SIP protocol from an engineering and design perspective – and the fact that security considerations seems to have been added in much later after the protocol design was essentially complete and the first set of users were already using it in the public user space.


A few weeks ago, newspapers and media were wringing hands at the ghastly bridge collapse in Minnesota (http://en.wikipedia.org/wiki/I-35W_Mississippi_River_bridge). Every media report was quick to focus on the seeming 'design failures' and money was quickly sanctioned across the country to 'inspect' the rest of the existing bridges.


Two years ago, after hurricanes Katrina and Rita wreaked havoc, more studies focused on the engineering failures there.


But is all of this truly engineering failure? Are we looking at the original specifications for the designs in consideration? TCP/IP and the associated network protocols worked perfectly for their original design – fault-tolerant robust network connectivity to share information between peer universities. The security problem surfaced after commercial interests worked to expand the original network into the Internet of today, without adapting the original protocol for their proposed use and/or testing it for the proposed set of uses.


The same is true for the bridge collapse and the hurricane stories – the engineers did their work and highlighted the limits of their design. However, other interests kicked in, signed off on unknown risks without complete information, and the result is the slurry pools we see today :) so I ask myself the question – should we be blaming the engineers for poor design?


Then again, all testing does not necessarily highlight all issues, as the Skype issue (http://blog.tmcnet.com/blog/tom-keating/skype/skype-offline-latest-update.) proves. The protocol seemed to work fine – till it reached the perfect tipping point – software updates, a P2P mesh that was never tested at this volume (I too would love a lab that could test 20 million simultaneous online users and help me prepare for all eventualities – but is that a fair request to make commercially of any organization to setup?), and a network with unprecented global usage. So who do we blame this on? Skype - (who else?) for providing a service that costs - for basic usage - nothing at all except the cost of an internet connection :)


The current environment seems to focus on finding someone to blame for all failures - irrespective of the validity of the failure and invalidity of the use that caused the failure. Engineering needs to step up to the plate in their defense. They need to shake off their recitence at public speaking, and document their engineered specifications better. And others need to watch their use patterns to fit the engineered specifications - or look for engineering modifications to fit new proposed usage. In the absence of this rigor, watch out for more failures in similar patterns! 20/20 hindsight is always correct - how about moving that correctness to before the failure rather than armchair pontification?

No comments: