Cleaning up the data swamp

abstract image of data swamp

In a world amassing a colossal amount of data, how are we wading through the swamp?


Throughout the last two decades, we’ve operated under the same narrative about data – hoard as much of it as fast as you can. To businesses’ credit, the majority have done just that with truly spirited gusto. As a planet, we’ve gone from collecting 700 terabytes of data in the year 2000, to last year collecting around 79 zetabytes of data.

Expectantly, all this data needs to be stored somewhere and vendors are continuing to up their capacity looking ahead. For Microsoft, it looks like the company is on pace to build between 50 and 100 new data centers each year to handle all our mounting requirements for the foreseeable future. Google operates, or is currently developing, its 35 data centers around the world. AWS, one of the biggest spenders alongside its hyperscaler competitors – Google and Microsoft – now covers all continents except Antarctica, with over 100 data centers in 31 regions, and is promising to launch another five regions soon.

Not to be outdone, Oracle has also announced this year that it plans to build hundreds of data centers to keep up with growing demand, with at least one for every country in the world depending on regional requirements.

While vendors are all very center-happy, an unnerving realization is hitting home with businesses. At a conference hosted by Sapphire Systems earlier this year, one visual shocked to the core of our ongoing strategies – we are wasting the equivalent of 10.5 million Mount Everest’s of data each year. At current rates, the amount of wasted data will rise to 69 percent of that collected by 2025, with McKinsey estimating that data to be worth $11.3tn in undiscovered value.

The fact is, beyond dumping our data somewhere, it also needs to be handled and put to good use – or else we’re looking at a pointlessly costly act (both monetarily and environmentally) for the sake of possibility, just-in-case and best practice appearances.

So why aren’t we using it?

It’s no doubt a daunting prospect to get a handle on that sheer amount of data. Breaking things down, the problem most often stems from two key factors.

One: there has been too much data to properly manage without some serious sorting tools.

“Typically in the past, data scientists or data engineers or data practitioners were building applications, and the data sets for them had to be very small,” explains Sunny Bedi, CTO at Snowflake. “The dashboards were like reading a newspaper. Readable, but then you have to have the imagination to ask the next ‘why’ question. And it would be really hard to visualize that. So, in the past, the data scientists would have to take all that data and do these things outside of the system.”

Some refer to them as data swamps; you’re throwing data in there where you’ve lost sight of lineage and understanding. You’re creating more of a problem – John Burke, UBIX Labs

What resulted was a lot of slow and even manual processing of a fraction of the possibilities to be gleaned, with the rest of the data filling up its lake, murky and unused, with up to 80 percent of the world’s data unstructured.

“Some refer to them as data swamps, because a lot of times the data lake is like users were just told to get as much data in there as possible and then determine value later. But that means that you’re creating this data swamp that’s not optimized for analytics,” explains John Burke, CEO and co-founder at UBIX Labs. “Sometimes you’re throwing data in there where you’ve lost sight of lineage and understanding. So, you’re creating more of a problem than a valuable store.”

Two: data scientists have been difficult to get hold of and, even then, to keep.

That person either doesn’t work there anymore or the data sources or circumstances within the company have changed. So all of that work effort to some degree is lost – Miles Mahoney, UBIX Labs

There is a well-known shortage of data scientists, and it shows. From 2013 to 2019, data science postings were shown to increase by 256 percent. Deloitte research shares that data scientists “know they are in demand” and spend an average of one-two hours a week looking for new jobs, welcoming offers from potential employers. Between 2017 and 2022, results from one study suggested that data professionals remained with their employers for an average of only 1.7 years. Taking vital process and progress knowledge to pastures new, a data swamp is left behind.

“Part of the challenge we hear is they’ll run through solving problem A, then that person either doesn’t work there anymore or the data sources have changed or circumstances within the company have changed. So all of that work effort to some degree is lost,” says Miles Mahoney, president and co-founder at UBIX Labs.

The new tools to clean up our act

The bar we set was that we would have our AI referenceable; it would do things like cited sources, and the information would also be real-time – Sridhar Ramaswamy, Snowflake

This year has seen the potential for data handling to drastically transform with the advent of tools to enable an easier, algorithmic search and processing for specific data, as well as the ability to begin to democratize data handling across the enterprise.

For Sridhar Ramaswamy, co-founder of intelligent search engine Neeva, and now Snowflake SVP, finding a better way to search through data was the way forward.

“The bar we set ourselves was that we would have our AI referenceable, meaning it would do things like cited sources, and the information from it would also be real-time,” says Ramaswamy.

In May this year, Neeva was acquired by Snowflake and has now launched a whole new set of solutions for better data handling within its cloud data lake, removing the need to transfer the data outside of the platform and therein minimizing factors such as egress charges and governance issues.

Its Document AI tool, for instance, allows users to interpret and extract semi-structured information from unstructured PDFs, word documents, txt and screenshots, giving users more digestible and useful information at a faster and scalable pace. Meanwhile, LLM-powered Universal Search can query and locate objects within a database, such as tables, views, schemas and marketplace apps.

“Now we have an ability to train through LLM for these complex unstructured data and ask through conversational AI interfaces these simple English questions to those documents and say, ‘please tell me XYZ about this scenario’. You can translate and decode and in seconds, it gives you answers that have previously been hard for you to get,” says Snowflake’s Bedi. “A big portion of AI use cases is actually how you leverage unstructured data and LLMs to de-code the belt of information sitting in these unstructured PDFs and other documents.”

Taking things further still, firms are also creating ways to democratize and simplify data handling with no-code, low-code tools, making functions accessible and reusable, not only freeing up the data scientists’ time with automation, but also enabling users without a coding background to access and analyze data.

Co-pilots are set to enable users to generate and refine SQL queries with text-to-code functionality. Other providers such as UBIX meanwhile are offering no-code, low-code libraries for analytics workflows, creating repeatable and highly accessible data analytics processes.

For Burke, it’s this kind of technology that is defeating many of the blockers to better data management: “It’s freeing a business up from the reliance on a very scarce resource and allowing them to be empowered to solve problems. Because most data scientist projects are a black box, even if you had questions, the business side doesn’t usually get answers. Now, you could bring something to life really easily without having to be a data scientist.

“Innovating faster, it also actually helps bridge that gap between technical teams in that situation, which the technical people love because it allows them to showcase how the data that they built can turn to business value.”

Using Snowflake, it’s prompted businesses such as Bentley Motors Limited to create greater data upskilling across all its employees, enabling easier entry to learn the ropes of data handling and analytics under its CDO Dr Andy Moore’s strategy of “data is a team sport”. So named the data science dojo, its data literacy program sees students awarded with graded belts.

Bedi explains: “You can use democratization – I like the word inclusiveness. I want all our 6500 employees to have access and include them in this transformation journey. You want to give them all the necessary tools, scale and systems that they need. And for them to have the right imagination on how they achieve an A plus. Humans want to do the best and if they have the right measurement in place everything can work.”


Does this mean data scientists will die out?

With an increasing amount of employees utilizing data science tools, are we looking at a future without the need for specialist data scientist roles, or else fewer available?

“I mean, I don’t think so. It will definitely make them more productive,” says Ramaswamy. “What we at Snowflake tried to do is make more sophisticated algorithms available to people with less technical knowledge so that they can use them without necessarily getting into trouble.”

Therein, rather than making them an extinct breed of techie, it’s looking like these kinds of tools will instead enhance their work lives, removing some of the laborsome elements of the role and enticing scientists to stick around for a more creative and riveting to-do list, and if not, at least enabling continuity of their work if they still decide to move on.

If we can eliminate a lot of that work, I think there’s a tremendous amount of developer productivity that kicks in –Sunny Bedi, Snowflake

For Bedi, it’s a case of providing every worker with an assisted experience: “Where some code suggestion and co-pilot assistant will help to write a portion of your code, I would argue that solves 20-30 percent of it. But, where I think we see a more highly impactful, useful case for the IT organization is – we don’t want to do QA, we don’t want to do testing. We want an AI assistant to help us with that. We want to do more design and development. And so, if we can eliminate a lot of that work, I think there’s a tremendous amount of developer productivity that kicks in.

“That shared use case of an AI assistant, imagine you have 6000 employees and 6000 assistants working for them, just think about the productivity that they could drive. You can’t have an AI strategy without having a data strategy. My recommendation would be don’t make this a very complex long-term strategy. Break it up into chunks and make it actionable and start one or two use cases that you can create that value.”

This doesn’t mean less data will be stored

While the advent of these tools is set to make our data-dealing lives much simpler, this also means the amount of data we store isn’t going to get any smaller. The likes of LLMs are ballooning the amount of data stored, on top of our seemingly insatiable appetite for data hoarding.

“There’s one thing about ingesting and storing that data in 12 terabytes, but there’s a whole other thing around analyzing it because that 12 terabytes turns into 50, as you have all the processing going on. It’s not just the storage layer, it’s running algorithms against it and reducing that to feature sets and other things like that,” explains Burke. “A big point is that if you look at the growth of advanced analytics, it’s going from a $41bn market to $181bn market within three years. With that growth, I believe the majority of companies are going to make a purchase decision in the next three years.”

Though we can master better management of the data itself, simultaneously on the agenda has to be ensuring the building and running of these centers isn’t exacerbating an environmental disaster either. The increase in data centers has come with reports of large amounts of water usage to cool them in areas of drought, as well as ongoing energy consumption to keep them running.

The good news is that vendors are responding to some degree, with Microsoft and AWS planning for their data centers to be powered by 100 percent renewable sources of energy by 2025. Meanwhile, Google aims to champion responsible water use and operate entirely on 24/7 carbon-free energy by 2030.

There is certainly more to be done, but from an unruly and expanding data swamp perspective, it’s now looking more likely that businesses can ensure their data is managed and monetized effectively and, looking ahead, hopefully sustainably, with a lot less mess.