Recently while debugging some .NET code I encountered an interesting oddity - while processing some data from a text file exported from a SQL Server database, I was getting negative values for fields that shouldn’t have been negative. The data in the database all came from one source, and in this source the particular fields were defined as unsigned integers.
So where could have the negative values come from?
As mentioned earlier, the DBMS in question is SQL Server, which doesn’t have an unsigned integer data type. This table is defined as follows:
It might be obvious already, but it took a moment before it hit me - the culprit is integer overflows.
Overflows, i.e. the integer overflow kind, are not exactly a mysterious thing. You try to assign a value to a variable that is out of range of its’ data type you get a result that is not what mathematics says it should be. Let’s say you add 4321 to 30000 and try to assign it to a 16 bit integer, instead of 34321, you get -31215. Again, this is not very surprising, and is sometimes even desired.
One of my favourite books on .NET, CLR via C# by J. Richter (a book I highly recommend), devotes a section to it, warns to be wary and gives general guidelines on how to keep yourself out of trouble, but naturally doesn’t go into detail on every possible case.
For many people, myself included, overflows are not something that you have to think about every day. Also, .NET is platform that places much emphasis on safety. And, I must admit, .NET has spoiled me with all kinds of safety features, for example, array bound checking - I have a feeling that .NET will protect me from such errors.
And the fact is that the C# compiler does provide some checks for some cases of unintended overflow.
Let’s look at some examples:
It seems that the C# compiler will detect an overflow and output a compile error whenever the expression can be evaluated at compile time. However, when there’s an expression with a variable involved, no errors are generated.
You might notice that even in the cases when values overflow, the user still has to be explicit and provide an explicit cast. So how can this go unnoticed? Take the following example.
In my case .Insert(…) is a generated method, and it gets generated from the SQL Server types. Without the cast to
int? the code would not compile. Apparently, the programmer who wrote this assumed the conversion to be safe, as
uint ‘can hold only non-negative values’, while
int ‘can hold negative as well’. In the end it comes down to programmer error.
In our case this problem had existed and gone unnoticed for several years, and the database contained several hundred million records with this problem.
Luckily, no data was ever lost - the signed integer contains as much bits as its’ unsigned counterpart, the values can be reinterpreted/converted back to the original value.
If you need to do this in C#, by using the unchecked keyword the conversion can be as simple as this:
If you’re in SQL Server, this can be done by a temporary conversion to BIGINT and some modular arithmetic (or possibly some other clever trick).
Of course, one might say that the correct way to deal with this is to change to column data type to BIGINT and update all code. If you have an non-trivial amount of data, say several hundred million records, and an active database, this may take a while. In addition to that, the savings in space may matter. In fact, storing the value as INT is the most efficient storage data type. In our case, we decided to leave the data in the INT columns and change data retrieval logic to convert the data dynamically as needed.
Ways to protect yourself
There are a couple of ways to catch such errors early.
One way - be disciplined and explicitly use the
unchecked keywords when an overflow might occur. In addition to making the program safer, the intent is more clear. But this may be impractical in case of existing code bases.
Another way - use the
/checked compiler option - this sets the default behaviour outside of
One thing to keep in mind is that
checked integer arithmetic may negatively impact performance. In CLR via C#, J. Richter suggests turning on
checked mode by default in
Debug builds and leaving it off for
Release builds. That way you get the best of both worlds - catch errors more easily while debugging and testing, and run at full speed when in production.