Effective data pipelining is the bedrock of a huge number of data science projects. Even the best tools can’t run effectively if the data they are working with is wrong, malformed, or incomplete.
I’ve worked on several large data pipelining projects, ranging from robust data management at huge-scale, to light and efficient foundations for dashboards.
3.2bn rows of Search Console data
Search Console data is a hugely powerful insight into how a company is performing in search. But getting data directly from the interface means you’re limited to 1,000 rows at a time (which is often just a fraction of the available data). What’s more, the data that is there is automatically deleted after 16 months, so reporting over time is severely hampered.
To solve that – I created a pipeline for Aira which automatically extracts, transforms, and loads Search Console data for many clients.
To date, this pipeline has cheaply managed and databased more than 3.2bn rows of data. It forms the bedrock of many tools which are used in £millions of sales pitches and retainers.
We’ve shared this pipeline, and the tools it allows for, multiple times at Aira. Including in this video talking about one of the tools we created off of the back of the pipeline.
While there isn’t enough space to cover the logic and error handling in-depth, here’s a broad diagram of the process we use: