Analytical function to the rescuePublished on: Author: Team Communication Category: Oracle
Developers are in huge demand at the moment, and projections are that this will continue as the IoT and technology/digitisation are set to only increase in importance moving forwards. As Patrick Barel, Senior Oracle Developer and Oracle ACE Director, tackles a coding issue, his blog shows why being a developer is such a special profession – and that not everyone is cut out for the job as he is!
The other day, I tried to export 30 million rows from a table so that I could import it into a cloud database. The problem is, I don’t have DBA rights so I had to resort to the tools at my disposal. First, I tried to export the table using the exp.exe utility from Oracle. That went really smoothly, but when I tried to import the table using imp.exe, I was hit by all kinds of errors, most of them relating to missing privileges.
Intro the cloud database
So what next? I chose to export the data using the PL/SQL Developer export. Again, the export went smoothly. It took a lot longer to export this time, but that wasn’t my biggest problem. The real problem arose when I tried to import the table into my cloud database. This was going to take a really long time, so I fired up a separate machine that I could keep running overnight and that I wouldn’t need for my day job.
After a couple of hours of importing, I got the message: not connected to Oracle. Only a small part of the table had been imported. No problem, I thought to myself. Just start it up again, go to bed, and see in the morning. But again, the import was aborted with only about 10% done. Hmmm. I needed a new approach, so I split the export up into 1 million row batches and decided to import those. I was pretty sure that this would let the session finish without errors, such as disconnecting. But how was I going to split the table into parts of about 1 million rows? There is the option of providing a ‘where clause’, but what should I put in there?
How about a SQL solution?
I figured I should use the primary key column as a start and end point. But this column is not densely filled, of course, so I couldn’t just use the min value and add batches of 1 million to it. My first thought was to create an anonymous block, where I could find the column’s min value, and to start some sort of loop to count the rows with an id higher than the last one. Next I could find the id of the record at position 1, and from there find the next one etc.
I started the code, but got hopelessly stuck. I couldn’t get my head around the problem. That was the point at which I started thinking: ‘There has to be an easier way to do this. How about a SQL solution?’
Primary key for every x millionth row
What I was actually looking for was every x millionth row, and then the primary key for that row. I rephrased that to:
'select the primary key where the row number is a multiple of 1000000'
Using the analytical function ROW_NUMBER() would give me the row number for every row that was selected so I needed something like this:
SELECT vbt.id , ROW_NUMBER() OVER ( ORDER BY vbt.id ) rn FROM verybigtable vbt /
The 'edge' case
I was only interested in the ‘edge’ case, meaning those in which rn was a multiple of 1000000. Since you cannot put the analytical function in the ‘where clause’, I would have to use this query as a subquery. I put it in a ‘with clause’ for readability:
Wrapping this query into a with clause (subquery factoring) gave me:
WITH id_rn AS ( SELECT vbt.id , ROW_NUMBER() OVER ( ORDER BY vbt.id ) rn FROM verybigtable vbt ) SELECT id, rn FROM id_rn WHERE 1=1 AND MOD( id_rn.rn, 1000000 ) = 0 /
Constructing a 'where clause'
This query gave me all the ‘edges’ of the sets I wanted to create. So I thought, while I am at it, why not use this query to generate the ‘where clauses’ I need? Again, I wrapped the last query into another ‘with clause’ and used the LAG() function to give me the value of the previous row so I could construct a ‘where clause’:
WITH id_rn AS ( SELECT vbt.id , ROW_NUMBER() OVER ( ORDER BY vbt.id ) rn FROM verybigtable vbt ) , edges AS ( SELECT id, rn FROM id_rn WHERE 1=1 AND MOD( id_rn.rn, 1000000 ) = 0 ) SELECT 'id > ' || to_char( COALESCE ( lag( id ) OVER ( ORDER BY rn ), 0 ) ) || ' and id <= ' || to_char( id ) FROM edges /
Using the LEAD() function
Now, all I had to do was to manually create a ‘where clause’ for the final batch, but that is not too much work. But, while I’m at it, I might as well construct this ‘where clause’ as well I thought, this time using the LEAD() function to get the value of the next row:
WITH id_rn AS ( SELECT vbt.id , ROW_NUMBER() OVER ( ORDER BY vbt.id ) rn FROM verybigtable vbt ) , edges AS ( SELECT id, rn FROM id_rn WHERE 1=1 AND MOD( id_rn.rn, 1000000 ) = 0 ) SELECT 'id > ' || to_char( COALESCE ( lag( id ) OVER ( ORDER BY rn ), 0 ) ) || ' and id <= ' || to_char( id ) , 'id > ' || to_char( id ) || CASE WHEN lead( id ) OVER ( ORDER BY rn ) IS NULL THEN '' ELSE ' and id <= ' || to_char( COALESCE ( lead( id ) OVER ( ORDER BY rn ), 0 ) ) END FROM edges /
So that is where it stands. I could probably optimize this query, like performing the lead operation only once instead of twice as I am doing now I guess. But since this is a once-only query (or sometimes only 😊) I don’t think it’s worth the work.
So. Does this make sense? Not all of it? Some of it? This is where Qualogy helps. We understand it all. And love our job. So give us a call if we you need some help with developer challenges. We’re ready!