Almost every business has data stored in Excel spreadsheets or CSV files. That's not only because data is originated in one of these formats, but also because data is exported/extracted from other systems and applications into them. We've used these formats for a long time now, but the volume of data we work with has increased significantly, yet the tools remains mostly the same. Excel was launched about 35 years ago in 1986-1987 and the CSV as a data format predates the personal computer, with the IBM Fortran compiler supporting it in 1972.
Input entries are separated by blanks or commas, with successive commas indicating values to be omitted. - IBM FORTRAN Program Products for OS and the CMS Component of VM/370 General Information
Spreadsheets and CSVs are usually stored in file systems in your computer, SFTPs and email servers, and increasingly often backed up and synced to the cloud. They are stored as files, and while these files can be moved online, the underlying data is pretty much still “trapped” inside the files, even if they are stored in cloud drives. That's because it's still hard to analyze the data, to centralize it for reporting, or to automate business operations without further processing or technical help. To make this data useful, we still need to clean it, format it, structure it, and validate it.
On the one hand, spreadsheets like Excel, Google Sheets, or Airtable are simple to use, but they aren't built for large volumes of data or scale. On the other, tools that can scale and are built for volume are too complicated for most people to set up and use, and that’s especially true for an average spreadsheet user. There seems to be a gap between these two categories of tools.
Yet the problem isn't necessarily in the tools themselves. We have spreadsheets that are powerful, easy to use, and understand. We have databases and data warehouses that can handle scale, large amounts of data, and can even query a millions rows in a second or less.
One reason could be that we while we have tools for both sides of use-case extremes, each side continues to improve in areas that they are already good at. Spreadsheet developers continue making their spreadsheets better, adding more ways to filter, views, group, and pivot data. Database developers continue making their databases better, enhancing speed, reliability, performance, and security.
A solution could be something that brings the power of databases to spreadsheet users. Or alternatively, something that elevates the features of spreadsheet into tools that have the desirable properties of databases for dealing with larger volumes of data. Today, for the most part, spreadsheets are for business users and databases are for developers. But there is an emerging category of users - a power business user. These are users who need to work with more data, who have a basic understanding of databases and even SQL, and want database features, but struggle to set one up and easily import data to it without support from a technical team. These power business users want to work with databases but can't easily do so, because databases aren't built with them in mind.
One way to go about it could be to combine 3 concepts that almost everyone is familiar with: spreadsheets, file managers, and storage drives.
The idea would be something like this: you have a file manager, but instead of managing files, it lets you manage the underlying data in your files. This data is stored in a hard drive and organized by tables. You can access your data through a spreadsheet interface that allows you to view, edit, filter, and clean it. You can drag and drop some data to this file system and you can use this data they way you use a database. Embedded in this system, there’s a an actual database and a corresponding database engine that could be optimized for more analytical workloads (a column oriented database). When you open a table through this file system, it’s actually querying the embedded database. This process is of course abstracted from the user. To the user, opening a table would be like opening up an Excel file, only that they are really opening a database table.
One of the surprising challenges of using a database, aside from setting it up, is to actually create tables and import data to it. If the data from the spreadsheet file is perfectly clean, properly formatted, and validated, it’s relatively easy. But most practical data is imperfect, messy, and poorly formatted. So while in theory, importing an Excel file to a database should be easy, in practice it is from super tedious to impossible depending on the user, including even more technical users. Why is this?
Take a look at the following Excel spreadsheet. Do you notice anything that’s out of place?
The gotcha answer is no, and that’s because a lot Excel spreadsheets actually do look like this. Just ask any of your friends in non-technical roles. If anything, this specific Excel is on the less crazy size of the spectrum. (Feel free to send me a screenshot of the craziest Excel file you’ve seen).
Importing this file to your database is not trivial, even if it should be or you expect it to be.
First, note that the column headers don’t start in the first row. To make it more complicated, the data of interest is not directly adjacent to the column header.
Next, assuming you want to preserve data types and not just import everything as string or text, there are a few additional issues with this file that prevent you from easily importing it to a database. In no particular order:
Other challenges related to importing data is that spreadsheets allow users to create formulas (or pretty much code), pivot tables, and charts. In theory all of these can be recreated with apps/tools that sit on top of a database, but these concepts don’t translate 1-1 to a SQL database without additional work. You can’t just import a spreadsheet along with its formulas and automatically map that data to database tables and SQL statements. In theory it could be possible (or technically possible to a certain extent), but it wouldn’t be easy to do.
Finally, spreadsheets are meant to be “reactive”. By this I mean that changing a constant in one cell should update a calculation that depends on that constant in different cell. This applies not only with constant, cells, and calculations located in the same sheet, but also across sheets. For some use cases in certain industries, there could be spreadsheets with tens of sheets/tabs, thousands of rows, and millions of cell calculations altogether, times a thousand because that’s how many spreadsheets power the entire business. (This is not made up. I’ve seen this first hand when consulting for a huge pension fund). All of this this means that a solution would probably require a new kind of database that would look more like some hybrid between a transactional database, an analytical database, and maybe even a graph database - essentially a kind of database that can operate quickly with large amounts of data, and can still allow for high-performance reactivity, even with lots of calculations required.
There definitely seems to be a gap in between increasingly large spreadsheets and (analytical) databases, but I wonder - Is this a gap that could actually be closed and if so, would it make sense to?
Or should we leave spreadsheets as the ubiquitous, powerful, beautiful, but messy monster they are?