Chaos Testing Your Exception Handling

Transient exception handling and retry logic are considered an important defensive programming practice, especially in the public cloud. But how good is your exception handling? Unfortunately, it’s not always easy to simulate transient exceptions.

Consider the Azure Redis Service for example. It does not have a way to simulate failures. So we decided to create our own Chaos Redis library. Fortunately, Microsoft has developed a Windows port of Redis Cache.

We decided to modify the code so we can inject chaos.

You can get our changes by:

1. Adding a remote that points to the following repository:
git remote add chaosChanges https://github.com/lavbox/redis.git

2. Fetching the above changes into your repository:
git fetch chaosChanges

3. Merging these changes and ensuring you’re on your master branch:
git merge chaosChanges /master

In a nutshell, we defined the following three configuration settings to simulate chaos…

ErrorProbability: This setting specifies the probability of error occurrence as a percentage. A value of 25 means a probability of 1 out of 4 requests resulting in error. The value should be between 0 and 100.

FaultDownTime: This setting specifies how long the server should be in a faulty state. The value should be in seconds. A value of 60 means that the server will be in faulty state for a minute after a random error is introduced.

MinimumWaitTimeBetweenFaults: This setting specifies the minimum time between two faults. The value should be in seconds. A value of 300 means that the server will wait for five minutes after a faulty session to introduce a random error.

Redis server reads the above settings from the configuration file (chaos.conf) and injects failures accordingly.

The following flowchart depicts how the fault/chaos is injected based on the parameters described above.

Once you set these values, you can start your Redis Cache instance and then run your client code with transient handling built right in…and see exactly how well your code handles chaos scenarios.

Here is piece of client code that we tested against.

using System;
using System.Diagnostics;
using System.Reflection;
using System.Runtime.CompilerServices;
using System.Threading;
using System.Threading.Tasks;
using StackExchange.Redis;

namespace BasicTest
{
    class Program
    {
        static void Main(string[] args)
        {
            int AsyncOpsQty = 500;
            int retryWait = 5;

            if (args.Length == 1)
            {
                int tmp;
                if (int.TryParse(args[0], out tmp))
                    AsyncOpsQty = tmp;
            }

            Console.WriteLine("Working ...");

            MassiveBulkOpsAsync(AsyncOpsQty, retryWait);
  
        }
        static void MassiveBulkOpsAsync(int AsyncOpsQty, int retryWait)
        {
            using (var muxer = ConnectionMultiplexer.Connect("127.0.0.1:6379,abortConnect=false,connectTimeout=50000,syncTimeout=50000,connectRetry=20"))
            {
                RedisKey key = "MBOA";
                var conn = muxer.GetDatabase();

                var watch = Stopwatch.StartNew();
                int errorCountSet = 0;
                for (int i = 0; i <= AsyncOpsQty; i++)
                {
                    try
                    {
                        muxer.Wait(conn.StringSetAsync(key, i));
                    }
                    catch(Exception ex)
                    {
                        Thread.Sleep(TimeSpan.FromSeconds(retryWait));
                        i--;
                        errorCountSet++;
                    }                    
                }
                int errorCountGet = 0;
                while (true) 
                {
                    try
                    {
                        int val = (int)muxer.Wait(conn.StringGetAsync(key));
                        break;
                    }
                    catch (Exception ex)
                    {
                        Thread.Sleep(TimeSpan.FromSeconds(retryWait));
                        errorCountGet++;
                    }
                }
                watch.Stop();
                Console.WriteLine("Total time taken: {0} seconds", watch.Elapsed.TotalSeconds);
                Console.WriteLine("\tNumber of Set operations: {0}", AsyncOpsQty);
                Console.WriteLine("\tError count in Set operations: {0}", errorCountSet);
                Console.WriteLine("\tError count in Get operation: {0}", errorCountGet);
                Console.WriteLine("Press any key to exit ...");
                Console.ReadKey();
            }
        }
    }
}

About Vishwas Lele

Vishwas Lele serves as Chief Technology Officer at Applied Information Sciences, Inc. Mr. Lele is responsible for assisting organizations in envisioning, designing, and implementing enterprise solutions. Mr. Lele brings close to 24 years of experience and thought leadership to his position, and has been at AIS for 18 years. A noted industry speaker and author, Mr. Lele serves as Microsoft Regional Director for the Washington, D.C. area and is a member of Windows Azure Insiders group. Additionally, Mr. Lele received an MVP (Most Valuable Professional) for Solution Architecture.

  • Nice concept, but you should really aim for the main corpus of Redis users, i.e. penguin lovers 😉