Feeds:
Posts
Comments

Archive for the ‘Refactoring’ Category

I ran across this thoughtful post by John Sonmez showing a revision process whose goal was “refactor it to be as clean as possible.” Somehow the final result didn’t quite work for me and I’m writing this as I think about it to understand why. Here are the concerns of the original method:

Process:

  1. iterate over lines from a file
  2. split each line into columns
  3. iterate over each column
  4. log an error and stop processing if the column is invalid, otherwise
  5. Print the column

That’s a lot of concerns resulting in a potentially hard to follow implementation so the author refactored it with the goal of easier to read and therefore easier to maintain code in mind. The concerns of John’s final refactoring might be abstracted as follows:

Method1:

  1. iterate over lines from a file
  2. call Method2 on each line
  3. log an error for the first bad-column exception and stop processing

Method2:

  1. split the line into columns
  2. iterate over each column
  3. call Method3 on the each column

Method3:

  1. throw an exception if the column is invalid, otherwise
  2. Print the column

How many concerns does each method have? What should these methods be named? I came up with the following:

ProcessFileLinesAndLogFirstInvalidColumn()
SplitLinesIntoColumnsAndProcessEachColumn()
ThrowOnInvalidColumnDataOrPrintColumn()

Since the secondary methods are private their functionality can only be tested through the public method so they only exist to organize the code. Is there a better way? Lets go back to the original concerns:

  1. iterate over lines from a file
  2. split each line into columns
  3. iterate over each column
  4. log an error and stop processing if the column is invalid, otherwise
  5. Print the column

How should this method be named? Based on the overall goal of printing each column of each line in the file how about:

PrintFileColumns()

With that in mind lets examine each concern to see if it belongs here.

First is “iterate over lines from a file”. We don’t know where the lines are coming from exactly, and we shouldn’t care, so this looks good to me.

Next is “split each line into columns.” Because this code has detailed knowledge of the line structure, if the structure changes this code has to change. But splitting is not a concern of PrintFileContents() so this is a clear violation of Single Responsibility that needs to be refactored out.

Third is “iterate over each column” and that seems ok to me.

Fourth is “log an error and stop processing if the column is invalid.” As coded, this has the side effect that some columns may be printed before we encounter the invalid column. Is this desired behavior? What if printing is expensive and we get most of the way through then encounter the invalid column? We have to fix it, maybe truncate the file to this point then continue or just run them all again and pay the extra cost. And who is to say that the same thing won’t happen on the very next column or line or page? This seems flawed to me. I would think you’d want the entire list of problems so you could fix them all at once. However, if file is really an IEnumerable hooked to a web spider then the process of crawling would be expensive and we’d want to break fast. At this point I’d want to see where file comes from or talk to someone to figure out why this is coded the way it is and if we can change it. Either way logging invalid column data is not the concern of PrintFileColumns() and I would suggest that detecting invalid columns shouldn’t be either.

Lastly the method calls Print() on each column. The details of that process have already been abstracted away so this looks good.

I propose refactoring to the following:

First there would be an external class with method GetLineColumns() that knows the details of how columns are encoded into a line. Then the original method would be split as follows:

PrintFileColumns:

  1. GetFileColumns
  2. iterate over the columns
  3. if any column HasBadData stop processing, otherwise
  4. Print the column

This maintains the current behavior where we could get a partial print run before hitting a column with bad data.

GetFileColumns:

  1. iterate over lines from a file
  2. call GetLineColumns on each line
  3. return all columns from all lines

This converts all lines from the file to columns using GetLineColumns()

HasBadData:

  1. detect invalid column
  2. call LogErrorMessage for bad columns

This makes logging the concern of the invalid-column-data-detection method since it has the most information about the nature of the problem. Furthermore, I think this method should belong to a Column class instead of to this file processing class. That would allow us to encapsulate column validation and change it without having to change classes that apply the validation.

Here’s a likely implementation in C#:

	public void PrintFileColumns()
	{
		foreach(var column in GetFileColumns())
		{
			if (Column.HasBadData(column))
			{
				return;
			}
			Print(column);
		}
	}

	private IEnumerable<string> GetFileColumns()
	{
		return file.SelectMany(line => line.GetLineColumns());
	}

with column validation encapsulated in

public class Column
{
	public static bool HasBadData(string column)
	{
		bool result = false;
		if (/* detect problem */)
		{
			LogErrorMessage();
			result = true;
		}
		// detect next problem and log it, etc.
		return result;
	}
}

and line to column conversion

public static class StringExtensions
{
	public static string[] GetLineColumns(this string line)
	{
		return line.Split(',');
	}
}

Now if we determine that reading the file is not expensive then I would advocate for logging all invalid columns instead of only the first one. This would eliminate the if statement:

	public void PrintFileColumns()
	{
		foreach (string column in GetFileColumns()
			.Where(x => !Column.HasBadData(x)))
		{
			Print(column);
		}
	}

With the proper application of an Each extension method, this can become:

	public void PrintFileColumns()
	{
		GetFileColumns()
			.Where(x => !Column.HasBadData(x))
			.Each(Print);
	}

Read Full Post »