Script

Calculate the cumulative sum of a column using DuckDB

Duckdb, the (tabular) data exploration tool I use supports window operations. I recently discovered that it can also perform cumulative sums in a very efficient manner. Let us generate a toy dataset where we want to calculate the sum of one column relative to the order of another one. CREATE OR REPLACE TABLE seed AS SELECT SETSEED(0.1); -- seeding for reproducibility, creating a table to hide output -- Create a mock dataset with two integer columns CREATE OR REPLACE TABLE my_table AS SELECT #1 AS column_1, CAST(FLOOR(RANDOM() * 100) AS INT) AS column_2 FROM generate_series(1, 10); -- This generates 10 rows SELECT * FROM my_table; -- We write it to a csv for future use COPY my_table TO my_table.csv; ┌──────────┬──────────┐ │ column_1 │ column_2 │ │ int64 │ int32 │ ├──────────┼──────────┤ │ 1 │ 27 │ │ 2 │ 45 │ │ 3 │ 2 │ │ 4 │ 84 │ │ 5 │ 84 │ │ 6 │ 26 │ │ 7 │ 18 │ │ 8 │ 65 │ │ 9 │ 97 │ │ 10 │ 11 │ ├──────────┴──────────┤ │ 10 rows 2 columns │ └─────────────────────┘ If we wanted to calculate the distribution of the cumulative sum of the table we could use the OVER clause to perform the sum of column_2 in the order defined by column_1. ...

Run multiple python scripts in the background

To solve a multitude of challenges I have faced when processing high throughput microscopy data, have developed Nahual, a tool that allows me to move data across multiple Python environments that deploy deep learning models in the background. I usually keep these models “listening” in the background for the main analysis pipeline (aliby) to send them data to process. To be able to monitor what’s going on inside of these scripts I use GNU screen, which allows me to detach and reattach into these sessions whenever I need to. At some point I had to reboot my server and had rerun all these in independent screens. This rudimentary shell script did the job: ...

Simple progress indicators with awk

I wanted a simple way to see the progress of a data processing pipeline, and the internal progress bar tools were messed up by threading. I thus decided to use the number of output files in each folder as an indicator of progress. In my case the output of tree . looks like this: . └── steps ├── A01_001 │ ├── segment_nuclei │ │ ├── 0000.npz │ │ ├── 0001.npz │ │ ├── ... │ │ └── 0019.npz │ ├── tile │ │ ├── 0000.npz │ │ ├── 0001.npz │ │ ├── ... I can get the info I need by counting the total number of files and the occurrences of the A01_001 -> P24_005 range (these are fields of view from a microscopy experiment). Using this simple find command we get all the files in the current folder. ...

Update figure numbering

I was editing some markdown and had to insert a new figure in the middle. The problem is that this document already has an explicit figure numbering (e.g., “Figure 5”), so changing tens of figures felt dull. I like to run small (GNU) awk scripts for this type of tasks. # update_figures.awk { if (match($0, "Figure ([0-9]+)", num)){ if (num[1] > after) gsub("Figure ([0-9]+)", "Figure " num[1] + increase_by) }; print $0 } This changes Figure X into Figure X + increase_by starting after the variable “after”. And we can run it as follows: ...