Title: Improve the program that displays COVID-19 data for US states in C#
This post describes two improvements to the earlier example that displays COVID-19 data for US states. The first change is a basic software engineering improvement. The second displays daily changes in COVID-19 case, hospitalization, and death numbers.
Software Engineering Notes
While I was working on the second change to this program, I noticed that some of the data made no sense. The program shows that some states were getting tens of thousands of new COVID-19 deaths in a single day. That was obviously wrong. After a little digging, I discovered that the website where the program was downloading its data had changed the data's format. It had inserted an extra column in the middle of the CSV data so the columns that came after that one were in the wrong positions.
This raises two important software engineering issues: program lifespan and magic numbers.
Never assume that even the most trivial program won't be in use much later than you expect. In this case, I intended my original COVID-19 posts to be simple programs that you could use to examine the available data. I didn't think they were very important programs so I was a bit lazy about things like using magic numbers. I knew which columns contained the data, so I just plugged the column numbers into the program.
If you think this is an isolated problem, you're most definitely wrong. Remember the whole Y2K panic? That was caused because engineers (they didn't have programmers back then) back in the 60s assumed that there software wouldn't still be running 40 years later. I've also written quick throw-away programs that were used years after I thought they would be discarded.
If a program does something useful, people will continue to use it. They could write a newer version, but that would take work so why bother?
The point of this is, don't be lazy. If you expect a program to be useful for only a short period or only for your use, you don't need to give the full assortment or error handling, logging, printing, reporting, and other features that you would include in a commercial application, but don't be lazy. Take a little extra time to make the code as robust as you reasonably can.
In the original example, used the following constants to indicate the columns that contain different pieces of COVID-19 data.
const int colDate = 1;
const int colState = 2;
const int colPositive = 3;
const int colNegative = 4;
const int colPending = 5;
const int colHospitalizedNow = 6;
const int colHospitalizedTotal = 7;
const int colIcuNow = 8;
const int colIcuTotal = 9;
const int colVentNow = 10;
const int colVentTotal = 11;
const int colRecovered = 12;
const int colDeaths = 17;
Later the code can refer to a column as in colNegative rather than 4. That's an excellent first step, but I should have taken things one step farther. Rather than hard-coding the column numbers into the program, I should have made the program find the column numbers at runtime.
This brings up another important piece of software engineering advice. If you do not have total control over the sources of your data you cannot be sure that the data format won't change over time. In this example, there was no reason to think that the data format would change. Why should it? Well apparently the data provider thought there was a good reason.
Along with the realization that the data can change is the secondary idea that there's no reason why it cannot change again. Fool me once, shame on you. Fool me twice, shame on me!
The new program uses the following FindColumn method to find the columns for the various pieces of COVID-19 data.
// See which column contains the indicated column header.
private int FindColumn(object[,] fields, string header)
for (int i = fields.GetLowerBound(1);
i <= fields.GetUpperBound(1); i++)
if (fields[1, i].ToString().ToLower() == header.ToLower())
throw new Exception("Cannot find column " + header);
This method loops through the column headers in the CSV file's first row until it finds the desired column header. It then returns the column number.
Now the program can handle it if the data provider inserts or removes columns.
Note that it would still have problems if the provider renames one of the columns that the program uses. Hopefully that won't happen. I think that's a less common change than adding or removing a column.
Notice also that the code now throws an exception if it cannot find the desired column. The earlier version of the program used a hard-coded column number so it didn't realize that it was using the wrong data. Now if there is a problem, the program will immediately make it obvious so I can fix it.
I want to make one final note here before moving on. For similar reasons, it is generally better if you use keys, column names, and other identifying names rather than indices. For example, when you fetch data from a database, you can index the columns by name instead of number. Then if the columns are reordered, the program still works. If a column that you use is renamed, the program will fail rather than using the wrong data and producing incorrect results that may be hard to detect.
The previous versions of this program let you visualize COVID-19 data. For example, the following picture shows the previous version showing COVID-19 death data for the state of Colorado.
What we want to see is a flattening of this curve. That would indicate that the number of deaths is leveling off, hopefully eventually becoming horizontal indicating no new COVID-19 deaths. But it's a bit hard to see how effectively the curve is flattening.
If we look at the curve's slope, that may make it easier to see if the number of cases is declining. You could calculate that value by subtracting one day's number from the previous day's value, but the data already includes daily increases in positive tests, hospitalizations, and deaths, so displaying those values is relatively easy. I just modified the program to display the new columns much as it displays the other columns. See my previous COVID-19 posts and download the example to see the details.
If you look again at the picture at the top of this post, you'll see how the death data has changed daily in Colorado. You can see that the deaths reached a peak on April 25 and have been more or less decreasing since then.
It's a bit easier to see the general shape of the data if you connect the peaks. It's a bit easier to visualize if you resize the program so it's relatively short and wide, although then it wouldn't fit on the web page very well. (I did slip in one other useful change to the code. If you resize the form, the program redraws the currently selected data. In the previous versions you needed to change your selections to make the program redraw.)
The main software engineering lesson to be learned from this example is, don't be lazy. Make your programs use named columns, keys, or other identifiers instead of indexes when you can.
This example also lets you see the daily increase in positive tests, hospitalizations, and deaths from COVID-19 in the United States.
I haven't been able to find an explanation or even a mention of the general spikiness of the data. In Colorado, for example, the death data seems to have a spike every 3 days or so. I suspect this is a reporting artifact.
The following picture shows the combined values for the whole country.
In this picture the peaks are remarkably close to once per week, for some reason on Thursdays. This is almost surely an artifact of the way the data is gathered and reported.
Download the example to experiment with it and to see additional details.