Simple progress indicators with awk

I wanted a simple way to see the progress of a data processing pipeline, and the internal progress bar tools were messed up by threading. I thus decided to use the number of output files in each folder as an indicator of progress. In my case the output of tree . looks like this:

.
└── steps
    ├── A01_001
    │   ├── segment_nuclei
    │   │   ├── 0000.npz
    │   │   ├── 0001.npz
    │   │   ├── ...
    │   │   └── 0019.npz
    │   ├── tile
    │   │   ├── 0000.npz
    │   │   ├── 0001.npz
    │   │   ├── ...

I can get the info I need by counting the total number of files and the occurrences of the A01_001 -> P24_005 range (these are fields of view from a microscopy experiment). Using this simple find command we get all the files in the current folder.

find . -type f

which results in this:

./steps:
./steps/A01_003/tile/0007.npz
./steps/A01_003/tile/0009.npz
./steps/A01_003/tile/0018.npz
./steps/A01_003/tile/0016.npz
...

We could use wc -l to get the number of files per directory, we want a bunch of progress bars to get a better sense of change over time. For this I use awk, my swiss-army knife for text processing, and I write a short script that counts, sorts and prints the number of occurrences as a number of dots. I also added a conditional to only track after more than one file has been produced, for pipelines that produce save one file before actually running the whole pipeline.

# progress_bar.awk
{
    if (match($0,"([A-P][0-9]{2}_[0-9]{3})", capture)){
        count[capture[1]] += 1
    }

}
END{
    n=asorti(count, sorted)
    for (i=1; i<=n; i++){
        nfiles = count[sorted[i]]
        if (nfiles > 1){
            s = sprintf(key "%*s", nfiles, "");
            gsub(".", ".", s)
            print sorted[i] " " s
        }
    }
}

Running the find command and the awk script (find . -type f | awk -f progress_bar.awk) yields the following snapshot of the processing progess

A01_001 ...............................................................
A01_002 ...............................................................
A01_003 ...............................................................
A01_004 ...............................................................
A01_005 ...............................................................
A02_001 ...............................................................
A02_002 ...............................................................
A02_003 .................................................
A02_004 ..........................................
A02_005 ........................................
A03_001 ..............................................

Thus the last thing to do is to use `watch` to automatically refresh the status:

watch -dc --interval 1 'find . -type f | awk -f progress_bar.awk | tac'

The watch flag -d highlight the changes over time and -c enables intrepreting ANSI colours, in my terminal this makes the changes last stay longer, but YMMV. Finally, tac makes sure that the last lines are displayed at the top. I like to run this command somewhere in another terminal or in a `screen` terminal multiplexer. When the number of rows becomes too high it may be useful find a heuristic to remove uninformative lines.