Analytical function to the rescue

Analytical function to the rescue

Gepubliceerd: Categorie: Oracle

Deze blog is alleen beschikbaar in het Engels. 

Developers are in huge demand at the moment, and projections are that this will continue as the IoT and technology/digitisation are set to only increase in importance moving forwards. As Patrick Barel, Senior Oracle Developer and Oracle ACE Director, tackles a coding issue, his blog shows why being a developer is such a special profession – and that not everyone is cut out for the job as he is! 

The other day, I tried to export 30 million rows from a table so that I could import it into a cloud database. The problem is, I don’t have DBA rights so I had to resort to the tools at my disposal. First, I tried to export the table using the exp.exe utility from Oracle. That went really smoothly, but when I tried to import the table using imp.exe, I was hit by all kinds of errors, most of them relating to missing privileges.

Into the cloud database

So what next? I chose to export the data using the PL/SQL Developer export. Again, the export went smoothly. It took a lot longer to export this time, but that wasn’t my biggest problem. The real problem arose when I tried to import the table into my cloud database. This was going to take a really long time, so I fired up a separate machine that I could keep running overnight and that I wouldn’t need for my day job.

After a couple of hours of importing, I got the message: not connected to Oracle. Only a small part of the table had been imported. No problem, I thought to myself. Just start it up again, go to bed, and see in the morning. But again, the import was aborted with only about 10% done. Hmmm. I needed a new approach, so I split the export up into 1 million row batches and decided to import those. I was pretty sure that this would let the session finish without errors, such as disconnecting. But how was I going to split the table into parts of about 1 million rows? There is the option of providing a ‘where clause’, but what should I put in there?

How about a SQL solution?

I figured I should use the primary key column as a start and end point. But this column is not densely filled, of course, so I couldn’t just use the min value and add batches of 1 million to it. My first thought was to create an anonymous block, where I could find the column’s min value, and to start some sort of loop to count the rows with an id higher than the last one. Next I could find the id of the record at position 1, and from there find the next one etc.

I started the code, but got hopelessly stuck. I couldn’t get my head around the problem. That was the point at which I started thinking: ‘There has to be an easier way to do this. How about a SQL solution?’

Primary key for every x millionth row

What I was actually looking for was every x millionth row, and then the primary key for that row. I rephrased that to:

'select the primary key where the row number is a multiple of 1000000'

Using the analytical function ROW_NUMBER() would give me the row number for every row that was selected so I needed something like this:

SELECT vbt.id
,      ROW_NUMBER() OVER ( ORDER BY vbt.id ) rn
FROM   verybigtable vbt
/

The 'edge' case

I was only interested in the ‘edge’ case, meaning those in which rn was a multiple of 1000000. Since you cannot put the analytical function in the ‘where clause’, I would have to use this query as a subquery. I put it in a ‘with clause’ for readability:

Wrapping this query into a with clause (subquery factoring) gave me:

WITH id_rn AS ( SELECT vbt.id
                ,      ROW_NUMBER() OVER ( ORDER BY vbt.id ) rn
                FROM   verybigtable vbt
              )
SELECT id, rn
FROM   id_rn
WHERE  1=1
AND    MOD( id_rn.rn, 1000000 ) = 0
/

Constructing a 'where clause'

This query gave me all the ‘edges’ of the sets I wanted to create. So I thought, while I am at it, why not use this query to generate the ‘where clauses’ I need? Again, I wrapped the last query into another ‘with clause’ and used the LAG() function to give me the value of the previous row so I could construct a ‘where clause’:

WITH id_rn AS ( SELECT vbt.id
                ,      ROW_NUMBER() OVER ( ORDER BY vbt.id ) rn
                FROM   verybigtable vbt
              )
,    edges AS ( SELECT id, rn
                FROM   id_rn
                WHERE  1=1
                AND    MOD( id_rn.rn, 1000000 ) = 0
              )
SELECT 'id > ' ||
       to_char( COALESCE ( lag( id ) OVER ( ORDER BY rn ), 0 ) ) ||
       ' and id <= ' ||
       to_char( id )
FROM   edges
/

Using the LEAD() function

Now, all I had to do was to manually create a ‘where clause’ for the final batch, but that is not too much work. But, while I’m at it, I might as well construct this ‘where clause’ as well I thought, this time using the LEAD() function to get the value of the next row:

WITH id_rn AS ( SELECT vbt.id
                ,      ROW_NUMBER() OVER ( ORDER BY vbt.id ) rn
                FROM   verybigtable vbt
              )
,    edges AS ( SELECT id, rn
                FROM   id_rn
                WHERE  1=1
                AND    MOD( id_rn.rn, 1000000 ) = 0
              )
SELECT 'id > ' ||
       to_char( COALESCE ( lag( id ) OVER ( ORDER BY rn ), 0 ) ) ||
       ' and id <= ' ||
       to_char( id )
,      'id > ' ||
       to_char( id ) ||
       CASE WHEN lead( id ) OVER ( ORDER BY rn ) IS NULL
         THEN ''
         ELSE ' and id <= ' ||
              to_char( COALESCE ( lead( id ) OVER ( ORDER BY rn ), 0 ) )
       END
FROM   edges
/

So that is where it stands. I could probably optimize this query, like performing the lead operation only once instead of twice as I am doing now I guess. But since this is a once-only query (or sometimes only 😊) I don’t think it’s worth the work.

More information?

So. Does this make sense? Not all of it? Some of it? This is where Qualogy helps. We understand it all. And love our job. So give us a call if we you need some help with developer challenges. We’re ready!

Patrick Barel
Over auteur Patrick Barel

Patrick heeft Telematica gestudeerd en is sinds 1999 fulltime Oracle Developer. Na zijn studie heeft hij meerdere PL/SQL-cursussen gevolgd (o.a. van Steven Feuerstein) Ook heeft hij Oracle's OCA- en OCP-programma afgerond op het gebied van SQL en PL/SQL. Patrick heeft naast zijn werk met SQL en PL/SQL verschillende plug-ins geschreven voor PL/SQL Developer (http://www.allroundautomations.com/). Patrick publiceert ook regelmatig op zijn eigen blog (http://blog.bar-solutions.com/). In 2011 heeft Patrick de titel Oracle ACE toegewezen gekregen voor zijn bijdragen aan het vakgebied SQL en PL/SQL. In 2015 heeft Patrick de Developer Choice Award toegewezen gekregen. In 2019 heeft Patrick de titel Oracle ACE Director toegewezen gekregen voor zijn bijdragen aan de Oracle-community.

Meer posts van Patrick Barel
Reacties
Reactie plaatsen