When you are tasked with building bots which will process really large spreadsheets, what would you do?
Would you use the Excel Action Package and wait for the bot to loop through each record?
Or would you dump that task onto the junior developers and guilt trip them for not taking this as a “learning experience”, in case they rebel?
This has happened to me on several occasions and yes, it was quite the learning experience.
No but seriously, how would you go about developing this bot to process copious quantities of data?
You interact with Excel like it were a Database!
If that didn’t ring any bells, then you need to go through the introduction before reading any of this.
I’ve used them in few projects, and would have been lost without them. Not only was I able to complete my project ahead of time, my manager even scolded me for it.
Apparently my work was not limited to whatever was specified in the Statement of Work, but also whatever the client decides to squeeze in after the SOW sign off.
I have covered basic operations like connecting to Excel and CSV, filtering data and referencing Columns by their names and/or positions in that article.
Today we will learn advanced operations like manipulating data from various sheets, or CSV files and combining them into a single consolidated unit ready for export.
So without further ado, lets join hands and welcome JOINS!
Get In Touch With Your Inner Join
Being selective about the columns you want from a given table is all fine and dandy, but how about being selective about columns from tables spread across different sheets and files?
How will you make the connection between various sheets or files?
Is there something we can use to “join” them together?
Why yes, yes there is.
Do keep in mind that the Database Operations performed on Excel use Access Engine Syntax, so do keep in mind that you ought to reference the right functions or else it won’t work.
SELECT LOWER([Title]) FROM [Movies$$] //Won’t work SELECT LCASE([Title]) FROM [Movies$$] //Will work
Similarly, your Joins have to follow a slightly different syntax for it to work as expected.
In the excel file located at ExcelDatabaseOperations/Excel Files given above, you will find two sheets Movies and Budget.
We are interested in retrieving the Lead Actor, Director from the Movies Sheet, and Gross Revenue from the Budget Sheet.
Let’s see how that’s done.
SELECT [Lead Actor], [Director], [Gross Revenue]
What comes next?
Where are we going to pull that data FROM?
SELECT [Lead Actor], [Director], [Gross Revenue] FROM [Movies$$] AND [Budget$$]
If this is what you were thinking, then you are mistaken.
Not to worry, mistakes are natural when learning something new. This is what you should do instead.
SELECT t1.[Lead Actor], t1.[Director], t2.[Gross Revenue] FROM [Movies$$] AS t1
The aliases are important since we will be referencing two tables, and the complier needs some sort of a reference to perform the desired operation…but where is our Budget?
That is achieved through JOIN, and more specifically an INNER JOIN.
SELECT t1.[Lead Actor], t1.[Director], t2.[Gross Revenue] FROM [Movies$$] INNER JOIN [Budget$$] AS t2
Joins are incomplete without a condition. They just are, don’t ask me why!
SELECT t1.[Lead Actor], t1.[Director], t2.[Gross Revenue] FROM [Movies$$] INNER JOIN [Budget$$] AS t2 ON t1.[Title] = t2.[Title]
And that is how you perform an INNER JOIN.
But do keep in mind, the INNER JOIN retrieves only those values which are present in both Tables, i.e., it retrieves an intersection of data which meet a given condition.
What if you wanted to keep values from one or both tables?
You’re Making Me Feel Left Out(er)
You can either specify the JOIN as LEFT/RIGHT OUTER JOIN.
LEFT/RIGHT captures all elements from either table, based on how you structure the query.
SELECT t1.[Lead Actor], t1.[Director], t2.[Gross Revenue] FROM [Movies$$] LEFT OUTER JOIN [Budget$$] AS t2 --Captures all records from 'Movies' ON t1.[Title] = t2.[Title]
SELECT t1.[Lead Actor], t1.[Director], t2.[Gross Revenue] FROM [Movies$$] RIGHT OUTER JOIN [Budget$$] AS t2 --Captures all records from 'Budget' ON t1.[Title] = t2.[Title]
SELECT t1.[Lead Actor], t1.[Director], t2.[Gross Revenue] FROM [Movies$$] RIGHT JOIN [Budget$$] AS t2 --It isn't necessary to specify the keyword 'OUTER' ON t1.[Title] = t2.[Title] WHERE INSTR(t1.[Lead Actor],'Tom') = 1 --You can chain additional conditions either using ON or WHERE ORDER BY t1.[Title] --And order by whichever column you want
I tried to make some FULL OUTER JOINS, but couldn’t.
I spend way too much time experimenting with this and the errors I encountered drove me absolutely nuts. I thought it had something to do with the syntax, but that wasn’t it.
Microsoft Access doesn’t support FULL OUTER JOINs, but that doesn’t mean you can’t achieve the result.
It gets a little tricky, and you can read more here.
What If I Have Multiple Excel Files?
The connection string only has room for a single file path, and I even tried modifying the CSV connection string to make it work with multiple excel files.
You have to merge them into a single spreadsheet using the Excel Advanced: Merge Action before performing JOINS.
Can We Create Sheets with Database Operations?(Spoiler Alert: Yes)
Yep, and I discovered this pearl while surfing across the Apeople forum.
You often come across some pretty interesting problems there, which is why I encourage you to visit it regularly and like all my comments.
The syntax for this is pretty straightforward, and there is only one prerequisite.
Database operations work by establishing a connection with the data source before doing its magic. This means, you can create Sheets, but you can’t create the Spreadsheet itself.
The spreadsheet has to exist before we can add sheets into it.
To add sheets, you have to use a Database: Insert/Update/Delete Action.
CREATE TABLE [Let There Be Sheet] (FirstName VARCHAR, LastName VARCHAR, Department VARCHAR, Location VARCHAR); --Its better leaving the type as VARCHAR, i.e., make it generic
If you want to create a new spreadsheet and perform operations onto it, I won’t recommend using the File: Create Action to create the Excel Spreadsheet, instead use the Excel Advanced: Create Workbook.
The File: Create Action has a bug which will be addressed and resolved in next update.
Also, when you create a new Spreadsheet, it will always has a sheet named Sheet1, which is why I think it only makes sense to use this with pre-existing spreadsheets.
If you are still curious, you can simply try creating an Excel Spreadsheet using the File Action Package and try opening it. Its better if you get in the habit of playing around with different actions just to get a feel for it, as it helps you make better decisions during development and even prevent major production issues.
Update Sheets with…JOINS?
Its better if you find other ways of achieving this because there is a good chance that this operation will timeout.
UPDATE [Movies$$] AS t1 INNER JOIN [Budget$$] AS t2 ON t1.[Title]=t2.[Title] SET t1.[Title]= t2.[Profit] --The order is important
Of course, you can lengthen the SQL Timeout, but I would recommend using the Python Scripting Package instead, particularly the Pandas library.
Joining Multiple CSV Files
CSV are static files which means you can only query data from it and create joins.
Whether you have multiple sheets in a spreadsheet or multiple CSV files in a given Directory, you can chain as many JOINs as you want.
That will be illustrated here, as we will try to join three CSV files.
Now remember, the connection string for CSV files has to reference its Folder Path, and not the individual files. The table names are not the sheet names, but the file name along with its extension.
SELECT t2.[Lead Actor], t1.[Genre], t3.[Profit] FROM [TGL.csv] AS t1 INNER JOIN [TAD.csv] AS t2 ON t1.[Title] = t2.[Title]
When we chain JOINs, the previous INNER JOIN has to be converted into a sub-query i.e., just sandwich the meat between two moon brackets before adding the last slice of INNER JOIN.
SELECT t2.[Lead Actor], t1.[Genre], t3.[Profit] FROM ([TGL.csv] AS t1 INNER JOIN [TAD.csv] AS t2 ON t1.[Title] = t2.[Title]) INNER JOIN [DGB.csv] AS t3 ON t2.[Lead Actor] = t3.[Lead Actor]
If there are more files, then simply convert the entire block into a sub-query like so:
SELECT t2.[Lead Actor], t1.[Genre], t3.[Profit], tct.[Adsense Revenue] FROM (([TGL.csv] AS t1 INNER JOIN [TAD.csv] AS t2 ON t1.[Title] = t2.[Title]) INNER JOIN [DGB.csv] AS t3 ON t2.[Lead Actor] = t3.[Lead Actor]) INNER JOIN [TCT.csv] As tct ON t3.[Profit] = tct.[Profit]
And with that, we have covered most of the Database operations you will ever require when interacting with spreadsheets and CSV files.
I still haven’t found anything about CSV files, and whether we can actually manipulate data within it, but it doesn’t matter. We can always use a WHERE to filter data from it, write it to Excel, then update it.
All in all, there is nothing complicated about Database Operations. Exposure is the key, and the more familiar you become with it, the better equipped you are.
Now if you will excuse me, I have to get back to guilt tripping my juniors for not recognizing how generous I am for sharing all my work experience with them.