Feed aggregator

Mind Over Matter

Greg Pavlik - Fri, 2019-01-18 18:13
"Plotinus gave exquisitely refined expression to the ancient intuition that the material order is not the basis of the mental, but rather the reverse. This is not only an eminently rational intuition; it is perhaps the only truly rational picture of reality as a whole. Mind does not emerge from mindless matter, as modern philosophical fashion would have it. The suggestion that is does is both a logical impossibility and a phenomenological absurdity. Plotinus and his contemporaries understood that all the things that most essentially characterize the act of rational consciousness—its irreducible unity of apprehension, its teleological structure, the logical syntax of reasoning, and on and on—are intrinsically incompatible with, and could not logically emerge from, a material reality devoid of mind. At the same time, they could not fail to notice that there is a constant correlation between that act of rational consciousness and the intelligibility of being, a correlation that is all but unimaginable if the structure and ground of all reality were not already rational. Happily, in Plotinus’s time no one had yet ventured the essentially magical theory of perception as representation. Plotinus was absolutely correct, therefore, to attempt to understand the structure of the whole of reality by looking inward to the structure of the mind; and he was just as correct to suppose that the reciprocity between the mind and objective reality must indicate a reality simpler and more capacious than either: a primordial intelligence, Nous, and an original unity, the One, generating, sustaining, and encompassing all things. And no thinker of late antiquity pursued these matters with greater persistence, rigor, and originality than he did."

DB Hart commenting on the new translation of Plotinus's Enneads.

Is Star Schema necessary?

Dylan's BI Notes - Fri, 2019-01-18 12:30
A star schema describes the data by fact and dimension. From one angle, it is a data modeling technique for designing the data warehouse based on relational database technology.  In the old OLAP world, even though a cube is also links to the dimensions that describe the measure, we typically won’t call them Star Schema. […]
Categories: BI & Warehousing

SQL*Server Import Export Wizard 2017 vs Office 64 bit

Stephen Booth - Fri, 2019-01-18 10:57
Not Oracle but since we're having to multi-platform these days might be relevant to some. When you install SQL*Server Management Studio on 64 bit windows the Management Studio (SSMS, equivalent of SQL*Plus) is 64 bit but it installs the 32 bit version of Import Export Wizard (SSIS).  This is fine if you have the 32 bit version of Office installed or only import and export CSV or other Stephen Boothhttps://plus.google.com/107526053475064059763noreply@blogger.com0

DML Tablescans

Jonathan Lewis - Fri, 2019-01-18 07:25

This note is a follow-up to a recent comment a blog note about Row Migration:

So I wonder what is the difference between the two, parallel dml and serial dml with parallel scan, which makes them behave differently while working with migrated rows. Why might the strategy of serial dml with parallel scan case not work in parallel dml case? I am going to make a service request to get some clarifications but maybe I miss something obvious?

The comment also referenced a couple of MoS notes:

  • Bug 17264297 “Serial DML with Parallel scan performs single block reads during full table scan when table has chained rows in 11.2”
  • Doc ID 1514011.1 “Performance decrease for parallel DML on compressed tables or regular tables after 11.2 Upgrade

The latter document included a comment to the effect that 11.2 uses a “Head Piece Scan” while 11.1 uses a “First Piece scan”, which is a rather helpful comment. Conveniently the blog note itself referenced an earlier note on the potential for differentiating between migrated and chained rows through a “flag” byte associated with each row piece. The flag byte has an H bit for the row head piece, an F bit for the row first piece, and L bit for the row last piece and {no bits set} for a row piece in the middle of a chained row.

Side note: A “typical” simple row will be a single row-piece with the H, F and L bits all set; a simple migrated row will start with an “empty” row-piece in one block with the H bit set and a pointer (nrid – next rowid) to a row in another block that will have the F and L bits set and a pointer (hrid – head rowid) back to the head piece. A chained row could start with a row piece holding a few columns and the H and F bits set and a pointer to the next row piece which might lead to a long chain of row pieces with no bits set each pointing to the next row piece until you get to a row piece with the L bit set.  Alternatively you might have row which had migrated and chained – which means it could start with an empty row piece with just the H bit and a pointer to the next row piece, then a row piece with the F bit set, a back pointer to the header, and a next pointer to the next row piece, which could lead to a long chain of row pieces with no bits set until you reach a row piece with the L bit set.

Combining the comments about “head piece” and “first piece” scans with the general principles of DML and locking it’s now possible to start makings some guesses about why the Oracle developers might want updates through tablescans to behave differently for serial and parallel tablescans. There are two performance targets to consider:

  • How to minimise random (single block) I/O requests
  • How to minimise the risk of deadlock between PX server processes.

Assume you’re doing a serial tablescan to find rows to update – assume for simplicity that there are no chained rows in the table. When you hit a migrated row (H bit only) you could follow the next rowid pointer (nrid) to find and examine the row. If you find that it’s a row that doesn’t need to be updated you’ve just done a completely redundant single block read; so it makes sense to ignore row pieces which are “H”-only row pieces and do a table scan based on “F” pieces (which will be FL “whole row” pieces thanks to our assumption of no chained rows). If you find a row which is an F row and it needs to be updated then you can do a single block read using the head rowid pointer (hrid) to lock the head row piece then lock the current row piece and update it; you only do the extra single block read for rows that need updates, not for all migrated rows. So this is (I guess) the “First Piece Scan” referenced in Doc ID 1514011.1. (And, conversely, if you scan the table looking only for row pieces with the H flag set this is probably the “Head Piece Scan”).

But there’s a potential problem with this strategy if the update is a parallel update. Imagine parallel server process p000 is scanning the first megabyte of a table and process p001 is scanning the second megabyte using the “first piece” algorithm.  What happens if p001 finds a migrated row (flags = FL) that needs to be updated and follows its head pointer back into a block in the megabyte being scanned by p000?  What if p000 has been busy updating rows in that block and there are no free ITLs for p001 to acquire to lock the head row piece? You have the potential for an indefinite deadlock.

On the other hand, if the scan is using the “head piece” algorithm p000 would have found the migrated row’s head piece and followed the next rowid pointer into a block in the megabyte being scanned by p001. If the row needs to be updated p000 can lock the head piece and the migrated piece.

At this point you might think that the two situations are symmetrical – aren’t you just as likely to get a deadlock because p000 now wants an ITL entry in a block that p001 might have been updating? Statistically the answer is “probably not”. When you do lots of updates it is possible for many rows to migrate OUT of a block; it is much less likely that you will see many rows migrate INTO a specific block. This means that in a parallel environment you’re more likely to see several PX servers all trying to acquire ITL entries in the same originating block than you are  to see several PX servers trying to acquire ITL entries in the same destination block. There’s also the feature that when a row (piece) migrates into a block Oracle adds an entry to the ITL list if the number of inwards migrated pieces is more than the current number of ITL entries.

Conclusion

It’s all guesswork of course, but I’d say that for a serial update by tablescan Oracle uses the “first piece scan” to minimise random I/O requests while for a parallel update by tablescan Oracle uses the “head piece scan” to minimise the risk of deadlocks – even though this is likely to increase the number of random (single block) reads.

Finally (to avoid ambiguity) if you’ve done an update which does a parallel tablescan but a serial update (by passing rowids to the query co-ordinator) then I’d hope that Oracle would use the “first piece scan” for the parallel tablescan because there’s no risk of deadlock when only the query co-ordinator is the only process doing the locking and updating, which makes it safe to use the minimum I/O strategy. (And a paralle query with serial update happens quite frequently because people forget to enable parallel dml.)

Footnote

While messing around to see what happened with updates and rows that were both migrated and chained I ran the following script to create one nasty row. so that I could dump a few table blocks to check for ITLs, pointers, and locks. The aim was to get a row with a head-only piece (“H” bit), an F-only piece, a piece with no bits set, then an L-only piece. With an 8KB block size and 4,000 byte maximum for varchar2() this is what I did:


rem
rem     Script:         migrated_lock.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2019
rem     Purpose:
rem
rem     Last tested
rem             18.3.0.0
rem

create table t1 (
        n1 number,
        l1 varchar2(4000),
        s1 varchar2(200),
        l2 varchar2(4000),
        s2 varchar2(200),
        l3 varchar2(4000),
        s3 varchar2(200)
);

insert into t1 (n1,l1,s1) values(0,rpad('X',4000,'X'),rpad('X',200,'X'));
commit;

insert into t1 (n1,l1) values(1,null);
commit;

update t1 set
        l1 = rpad('A',4000),
        s1 = rpad('A',200),
        l2 = rpad('B',4000),
        s2 = rpad('B',200),
        l3 = rpad('C',4000),
        s3 = rpad('C',200)
where
        n1 = 1
;

commit;

execute dbms_stats.gather_table_stats(user,'t1');

update t1 set
        s1 = lower(s1),
        s2 = lower(s2),
        s3 = lower(s3)
where
        n1 = 1
;

alter system flush buffer_cache;

select
        dbms_rowid.rowid_relative_fno(rowid)    rel_file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        count(*)                                rows_starting_in_block
from
        t1
group by
        dbms_rowid.rowid_relative_fno(rowid),
        dbms_rowid.rowid_block_number(rowid)
order by
        dbms_rowid.rowid_relative_fno(rowid),
        dbms_rowid.rowid_block_number(rowid)
;

The query with all the calls to dbms_rowid gave me the file and block number of the row I was interested in, so I dumped the block, then read the trace file to find the next block in the chain, and so on. The first block held just the head piece, the second block held the n1 and l1 columns (which didn’t get modified by the update), the third block held the s1 and l2 columns, the last block held the s2, l3 and s3 columns. I had been expecting to see the split as (head-piece(, (n1, l1, s1), (l2, s2), (l3, s3) – but as it turned out the unexpected split was a bonus.

Here are extracts from each of the blocks (in the order they appeared in the chain), showing the ITL information and the “row overhead” information. If you scan through the list you’ll see that three of the 4 blocks have an ITL entry for transaction id (xid) of 8.1e.df3, using three consecutive undo records in undo block 0x0100043d. My update has locked 3 of the 4 rowpieces – the header and the two that have changed. It didn’t need to “lock” the piece that didn’t change. (This little detail was the bonus of the unexpected split.)


Block 184
---------
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.36  C---    0  scn  0x00000000005beb39
0x02   0x0008.01e.00000df3  0x0100043d.0356.2e  ----    1  fsc 0x0000.00000000

...

tab 0, row 1, @0xf18
tl: 9 fb: --H----- lb: 0x2  cc: 0
nrid:  0x00800089.0



Block 137       (columns n1, l1 - DID NOT CHANGE so no ITL entry acquired)
---------       (the lock byte relates to the previous, not cleaned, update) 
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.35  --U-    1  fsc 0x0000.005beb39
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn  0x0000000000000000

...

tab 0, row 0, @0xfcb
tl: 4021 fb: ----F--- lb: 0x1  cc: 2
hrid: 0x008000b8.1
nrid:  0x00800085.0



Block 133 (columns s1, l2)
--------------------------
Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.34  C---    0  scn  0x00000000005beb39
0x02   0x0008.01e.00000df3  0x0100043d.0356.2f  ----    1  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn  0x0000000000000000

...

tab 0, row 0, @0xf0b
tl: 4213 fb: -------- lb: 0x2  cc: 2
nrid:  0x008000bc.0



Block 188 (columns s2, l3, s3)
------------------------------
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.33  C---    0  scn  0x00000000005beb39
0x02   0x0008.01e.00000df3  0x0100043d.0356.30  ----    1  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn  0x0000000000000000

...

tab 0, row 0, @0xe48
tl: 4408 fb: -----L-- lb: 0x2  cc: 3

Note, by the way, how there are nrid (next rowid) entries pointing forward in every row piece (except the last), but it’s only the “F” (First) row-piece has the hrid (head rowid) pointer pointing backwards.

 

[Blog] Composite Lazy Loading in SOA 12c

Online Apps DBA - Fri, 2019-01-18 05:45

Composite Lazy Loading is one of the new features in SOA Suite 12c which allows the components, WSDLs, and XSDs for a composite to be loaded as… Want to Know More About Composite Lazy loading and how it helps? Visit: https://k21academy.com/soadev25 and consider our vey new blog on composite lazy loading covering: How Composite Lazy […]

The post [Blog] Composite Lazy Loading in SOA 12c appeared first on Oracle Trainings for Apps & Fusion DBA.

Categories: APPS Blogs

[Blog] Oracle Critical Patch Update (CPU) January 2019 Now Available

Online Apps DBA - Fri, 2019-01-18 05:38

Oracle has released Critical Patch Update (CPU) for January 2019 on 15th Jan 2019 with wide-ranging security update. Visit: https://k21academy.com/appsdba45 and consider our blog about new updates in Critical Patch Covering Product wise fixes of Jan 15 CPU for the following: ✔ Oracle E-Business Suite ✔ Oracle Fusion Middleware (Weblogic, SOA, IDM etc) ✔ Oracle […]

The post [Blog] Oracle Critical Patch Update (CPU) January 2019 Now Available appeared first on Oracle Trainings for Apps & Fusion DBA.

Categories: APPS Blogs

Two techniques for cloning a repository filestore, part II

Yann Neuhaus - Fri, 2019-01-18 03:27

This is part II of a two-part article. In part I, we introduced the Documentum repository file structure and saw a first way to transfer content files from one filesystem to another using the repository to get their full path on disk. We saw how to generate an easy to parallelize set of rsync commands to completely or partially copy a repository’s content files. Here, we’ll show another way to do that which does not imply querying the repository.
Before this, however, we need test data, i.e. sub-directories with files on disk. The good thing is that those don’t have to be linked to documents in the repository, they can just be loose files, even empty ones, scattered over Documentum-style sub-trees, and they are easy to produce.

Creating a test sub-tree

The usual test repository is an out of the box, practically empty one; therefore we need to create a dense Documentum-like sub-tree with lots of directories (up to 128 x 256 x 256 = 8128K ones per filestore, but we won’t go so far) and a few files, say 1 per subdirectory. No need to actually create related documents in a repository since we want to perform the copy of the files from outside the repository. As said above, the number of files is not really important as only their parent directory (and indirectly, its content) is being passed to rsync. A one-file terminal sub-directory would be enough for it to trigger its transfer. Also, their size does not matter either (we don’t want to measure the transfer speed, just to prepare performant rsync statements), so 0-byte files will do perfectly.
The short python program below creates a Documentum-style subdirectory.

#! /usr/bin/python3
import datetime
import time
import sys
import os
import random
import multiprocessing
import multiprocessing.pool
import functools
import traceback
import getopt

# creates a Documentum tree at a given filesystem location using multi-processes for speed;
# Usage:
#   ./make-tree.py root_point
# Example:
#   ./make-tree.py "/home/dmadmin/documentum/data/dmtest/content_storage01/0000c35c/80"

# prerequisite to prevent the error OSError: [Errno 24] Too many open files if too many processes are forked;
# ulimit -n 10000

# 12/2018, C. Cervini, dbi-services;

def Usage():
   print("""Generates a dense Documentum-style sub-tree (i.e. a 3-level subtree with files only in the last one) with 255 x 256 sub-directories and 256 x 256 x 100 empty files under a given sub-directory;
Randomization of number of subdirectories and number of files at each level is possible within given limits;
This is useful to test different tree walking algorithms.
Created directories and files under the root directory have hexadecimal names, i.e. 00, 01, ... ff.
Usage:
   make-tree.py [-h|--help] | [-d|--dry-run] | [-r|--root-dir ]
A dry-run will just print to stdout the command for the nodes to be created instead of actually creating them, which is useful to process them later by some more efficient parallelizing tool;
Example:
   make-tree.py --root-dir "dmtest/content_storage01/0000c35c/80"
If a forest is needed, just invoke this program in a shell loop, e.g.:
   for i in 80 81 82 83 84 85; do
      make-tree.py --root-dir "dmtest/content_storage01/0000c35c/${i}" &
   done
This command will start 6 O/S python processes each creating a subtree named 80 .. 85 under the root directory.
The created 6-tree forest will have the following layout:
dmtest/content_storage01/0000c35c                                                d
   /80                                                                           d
      /00                                                                        d
         /00                                                                     d
            /00                                                                  f
            /01                                                                  f
            ...                                                                ...
            /63                                                                  f
         /01
            /00
            /01
            ...
            /63
         ...
         /ff
            /00
            /01
            ...
            /63
      /01
         /00
            /00
            /01
            ...
            /63
         /01
            /00
            /01
            ...
            /63
         ...
         /ff
            /00
            /01
            ...
            /63
      ...
      /ff
         /00
            /00
            /01
            ...
            /63
         /01
            /00
            /01
            ...
            /63
         ...
         /ff
            /00
            /01
            ...
            /63
   /81
   ...
   /82
   ...
   /83
   ...
   /84
   ...
   /85
   ...
It will contain 6 x 256 x 256 directories and 6 x 256 x 256 x 1 files, unless randomization is requested, see the gp.* parameters;
   """)

# this function cannot be local to make_tree(), though it is only used there, because of the following error:
#     Can't pickle local object 'make_tree..make_files'
# the error prevents it to be invoked as a callback;
# actually, functions used in processes must have been entirely defined prior their usage; this implies that functions cannot be fork processes that calls themselves;
# therefore, the master function make_tree is needed that detaches the processes that execute the functions defined earlier, make_files and make_level;
# moreover, processes in a pool are daemonic and are not allowed to fork other processes;
# we must therefore use a global pool of processes allocated in the master function;
def make_files(dir):
   if gp.bRandomFiles:
      nbFiles = random.randint(gp.minFiles, gp.maxFiles)
   else:
      nbFiles = gp.maxFiles
   for nf in range(nbFiles):
      fullPath = dir + "/" + (gp.fileFormat % nf)
      if gp.bTest:
         print(f"touch {fullPath}")
      else:
         try:
            fd = os.open(fullPath, os.O_CREAT); os.close(fd)
         except:
           traceback.print_exc()
           print("ignoring ...")
   return nbFiles

# ditto;
def make_level(dir):
   """
   create a directory level under dir;
   """
   global gp

   if gp.bRandomSubDirs:
      nbSubDirs = random.randint(gp.minSubDirs, gp.maxSubDirs)
   else:
      nbSubDirs = gp.maxSubDirs
   level_dirs = []
   for nd in range(nbSubDirs):
      subDir = dir + "/" + (gp.dirFormat % nd)
      if gp.bTest:
         print("mkdir", subDir)
      else:
         try:
            os.mkdir(subDir)
         except:
            traceback.print_exc()
            print("ignoring ...")
      level_dirs.append(subDir)
   return level_dirs

# the master function;
# it creates 2 levels of directories and then empty files under the deep-most directories;
def make_tree(root):
   global gp, stats
   sub_level_dirs = []

   # list_dirs contains a list of the values returned by parallelized calls to function make_level, i.e. a list of lists;
   # get_dirs is called as a callback by map_async at job completion;
   def get_dirs(list_dirs):
      global stats
      nonlocal sub_level_dirs
      for l in list_dirs:
         stats.nb_created_dirs += len(l)
         sub_level_dirs.extend(l)

   # nb_files contains a list of the values returned by parallelized calls to function make_files, i.e. a list of numbers;
   # get_nb_files is called as a callback by map_async at job completion;
   def get_nb_files(nb_files):
      global stats
      stats.nb_dirs_at_bottom += len(nb_files)
      stats.nb_created_files += functools.reduce(lambda x, y: x + y, nb_files)
      stats.nb_created_dirs_with_files += functools.reduce(lambda x, y: x + y, [1 if x > 0 else 0 for x in nb_files])

   # callback for error reporting;
   def print_error(error):
      print(error)

   # first directory level;
   level_dirs = make_level(root)
   stats.nb_created_dirs += len(level_dirs)

   # 2nd directory level;
   sub_level_dirs = []
   gp.pool = multiprocessing.pool.Pool(processes = gp.maxWorkers)
   gp.pool.map_async(make_level, level_dirs, len(level_dirs), callback = get_dirs, error_callback = print_error)
   gp.pool.close()
   gp.pool.join()

   # make dummy files at the bottom-most directory level;
   gp.pool = multiprocessing.pool.Pool(processes = gp.maxWorkers)
   gp.pool.map_async(make_files, sub_level_dirs, len(sub_level_dirs), callback = get_nb_files, error_callback = print_error)
   gp.pool.close()
   gp.pool.join()

# -----------------------------------------------------------------------------------------------------------
if __name__ == "__main__":
# main;

   # global parameters;
   # we use a typical idiom to create a cheap namespace;
   class gp: pass
   
   # dry run or real thing;
   gp.bTest = 0
   
   # root directory;
   gp.root = None
   try:
       (opts, args) = getopt.getopt(sys.argv[1:], "hdr:", ["help", "dry-run", "root-dir="])
   except getopt.GetoptError:
      print("Illegal option")
      print("./make-tree.py -h|--help | [-d|--dry-run] | [-r|--root-dir ]")
      sys.exit(1)
   for opt, arg in opts:
      if opt in ("-h", "--help"):
         Usage()
         sys.exit()
      elif opt in ("-d", "--dry-run"):
         gp.bTest = 1
      elif opt in ("-r", "--root-dir"):
         gp.root = arg
   if None == gp.root:
      print("Error: root_dir must be specified")
      print("Use -h or --help for help")
      sys.exit()

   # settings for Documentum;
   # nb sub-directories below the "80";
   gp.maxLevels = 2
   
   # nb tree depth levels;
   gp.maxDepth = gp.maxLevels + 1
   
   # random nb sub-directories in each level;
   # maximum is 256 for Documentum;
   gp.bRandomSubDirs = 0
   gp.minSubDirs = 200
   gp.maxSubDirs = 256
   
   # random nb files in each level;
   # maximum is 256 for Documentum;
   gp.bRandomFiles = 0
   gp.minFiles = 0
   gp.maxFiles = 1
   
   # node names' format;
   gp.dirFormat = "%02x"
   gp.fileFormat = "%02x"
   
   # maximum numbers of allowed processes in the pool; tasks will wait until some processes are available;
   # caution not to choose huge values for disk I/Os are saturated and overall performance drops to a crawl;
   gp.maxWorkers = 40
   
   # again but for the counters;
   class stats: pass
   stats.nb_dirs_at_bottom = 0
   stats.nb_created_files = 0
   stats.nb_created_dirs = 0
   stats.nb_created_dirs_with_files = 0

   # initialize random number generator;
   random.seed()
         
   startDate = datetime.datetime.now()
   print(f"started building a {gp.maxDepth}-level deep tree starting at {gp.root} at {startDate}")
   if not gp.bTest:
      try:
         os.makedirs(gp.root)
      except:
         pass
   make_tree(gp.root)
   print("{:d} created files".format(stats.nb_created_files))
   print("{:d} created dirs".format(stats.nb_created_dirs))
   print("{:d} created dirs at bottom".format(stats.nb_dirs_at_bottom))
   print("{:d} total created dirs with files".format(stats.nb_created_dirs_with_files))
   print("{:d} total created nodes".format(stats.nb_created_dirs + stats.nb_created_files))

   endDate = datetime.datetime.now()
   print(f"Ended building subtree {gp.root} at {endDate}")
   print(f"subtree {gp.root} built in {(endDate - startDate).seconds} seconds, or {time.strftime('%H:%M:%S', time.gmtime((endDate - startDate).seconds))} seconds")

No special attention as been given to the user interface and most settings are hard-coded in the script; their values are currently set to produce a dense (256 x 256 = 64K sub-directories/filestore) Documentum-style directory sub-tree each with just 1 zero-byte file. It is pointless, but possible, to have more files here since we only want to generate a set of rsync commands, one for each non empty subdirectory.
Examples of execution:
Create a dense 64K sub-tree at relative path clone/dmtest/content_storage01/0000c35a/86, with one empty file leaf in each:

./make-tree-dctm4.py --root-dir clone/dmtest/content_storage01/0000c35a/86
started building a 3-level deep tree starting at clone/dmtest/content_storage01/0000c35a/86 at 2018-12-31 18:47:20.066665
65536 created files
65792 created dirs
65536 created dirs at bottom
65536 total created dirs with files
131328 total created nodes
Ended building subtree clone/dmtest/content_storage01/0000c35a/86 at 2018-12-31 18:47:34.612075
subtree clone/dmtest/content_storage01/0000c35a/86 built in 14 seconds, or 00:00:14 seconds

Sequentially create a 6-tree forest starting at relative path clone/dmtest/content_storage01/0000c35a:

for i in 80 81 82 83 84 85; do
./make-tree-dctm4.py --root-dir clone/dmtest/content_storage01/0000c35a/${i}
done
started building a 3-level deep tree starting at clone/dmtest/content_storage01/0000c35a/80 at 2018-12-31 18:41:31.933329
65536 created files
65792 created dirs
65536 created dirs at bottom
65536 total created dirs with files
131328 total created nodes
Ended building subtree clone/dmtest/content_storage01/0000c35a/80 at 2018-12-31 18:41:39.694346
subtree clone/dmtest/content_storage01/0000c35a/80 built in 7 seconds, or 00:00:07 seconds
started building a 3-level deep tree starting at clone/dmtest/content_storage01/0000c35a/81 at 2018-12-31 18:41:39.738200
65536 created files
65792 created dirs
65536 created dirs at bottom
65536 total created dirs with files
131328 total created nodes
Ended building subtree clone/dmtest/content_storage01/0000c35a/81 at 2018-12-31 18:42:14.166057
subtree clone/dmtest/content_storage01/0000c35a/81 built in 34 seconds, or 00:00:34 seconds
...
subtree clone/dmtest/content_storage01/0000c35a/84 built in 22 seconds, or 00:00:22 seconds
started building a 3-level deep tree starting at clone/dmtest/content_storage01/0000c35a/85 at 2018-12-31 18:43:06.644111
65536 created files
65792 created dirs
65536 created dirs at bottom
65536 total created dirs with files
131328 total created nodes
Ended building subtree clone/dmtest/content_storage01/0000c35a/85 at 2018-12-31 18:43:41.527459
subtree clone/dmtest/content_storage01/0000c35a/85 built in 34 seconds, or 00:00:34 seconds

So, the test creation script is quite quick with the current 40 concurrent workers. Depending on your hardware, be careful with that value because the target disk can easily be saturated and stop responding, especially if each tree of a forest is created concurrently.
Just out of curiosity, how long would it take to ls to navigate this newly created forest ?

# clear the disk cache first;
sudo sysctl vm.drop_caches=3
vm.drop_caches = 3
 
time ls -1R clone/dmtest/content_storage01/0000c35a/* | wc -l
1840397
real 2m27,959s
user 0m4,291s
sys 0m19,587s
 
# again without clearing the disk cache;
time ls -1R clone/dmtest/content_storage01/0000c35a/* | wc -l
1840397
real 0m2,791s
user 0m0,941s
sys 0m1,950s

ls is very fast because there is just one leaf file in each sub-tree; it would be whole different story with well filled terminal sub-directories. Also, the cache has a tremendous speed up effect and the timed tests will always take care to clear them before a new run. The command above is very effective for this purpose.

Walking the trees

As written above, the main task before generating the rsync commands is to reach the terminal sub-directories. Let’s see if the obvious “find” command could be used at first, e.g.:

loc_path=/home/documentum/data/dmtest/./content_storage_01
find ${loc_path} -type d | gawk -v FS='/' -v max_level=11 '{if (max_level == NF) print}'
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/00
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/01
...
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0b
...

find stopped at the last subdirectory, just before the files, as requested.
Can we get rid of the extra-process used by the gawk filter ? Let’s try this:

find ${loc_path} -mindepth 4 -maxdepth 4 -type d
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/00
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/01
...
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0b
...

We can, good. The -mindepth and -maxdepth parameters, and the -type filter, let us jump directly to the directory level of interest, which is exactly what we want.

The “find” command is very fast; e.g. on the test forest above:

time find ${loc_path} -mindepth 4 -maxdepth 4 -type d | wc -l
458752
 
real 0m10,423s
user 0m0,536s
sys 0m2,450s

10 seconds for the forest’s 458’000 terminal directories, with disk cache emptied beforehand, impressing. If those directories were completely filled, they would contain about 117 millions files, a relatively beefy repository. Thus, find is a valuable tool, also because it is directed to stop before reaching the files. Finally, Documentum’s unusual file layout does not look so weird now, does it ? Let’s therefore use find to generate the rsync commands on the terminal sub-directories it returns:

find $loc_path -mindepth 4 -maxdepth 4 -type d | xargs -i echo "rsync -avzR {} dmadmin@${dest_machine}:{}" > migr

Example of execution:

find $loc_path -mindepth 4 -maxdepth 4 -type d | xargs -i echo "rsync -avzR {} dmadmin@new_host:/some/other/dir/dmtest"
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/ea dmadmin@new_host:/some/other/dir/dmtest
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/5e dmadmin@new_host:/some/other/dir/dmtest
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/aa dmadmin@new_host:/some/other/dir/dmtest
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/cc dmadmin@new_host:/some/other/dir/dmtest
...

Subsequently, the commands in migr could be executed in parallel N at a time, with N a reasonable number that won’t hog the infrastructure.
The same gawk script showed in part I could be used here:

find $loc_path -mindepth 4 -maxdepth 4 -type d | gawk -v nb_rsync=10 -v dest=dmadmin@new_machine:/some/other/place/dmtest 'BEGIN {
print "\#!/bin/bash"
}
{
printf("rsync -avzR %s %s &\n", $0, dest)
if (0 == ++nb_dirs % nb_rsync)
print "wait"
}
END {
if (0 != nb_dirs % nb_rsync)
print "wait"
}' > migr.sh

Output:

rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/ea dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/5e dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/aa dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/cc dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/e5 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/bd dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/1d dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/61 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/39 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/75 dmadmin@new_machine:/some/other/place/dmtest &
wait
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/6d dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/d2 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/8c dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/a1 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/84 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/b8 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/a4 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/27 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/fe dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/e6 dmadmin@new_machine:/some/other/place/dmtest &
wait
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/4f dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/82 dmadmin@new_machine:/some/other/place/dmtest &
...
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/86/95/92 dmadmin@new_machine:/some/other/place/dmtest &
wait
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/86/95/20 dmadmin@new_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/86/95/95 dmadmin@new_machine:/some/other/place/dmtest &
wait
 
chmod +x migr.sh
nohup ./migr.sh &

The good thing with rsync in archive mode is that the commands can be interrupted anytime; once relaunched, they will quickly resume the transfers where they left since they work incrementally.
Here, no SQL statements to correct the dm_locations, if needed, because we don’t connect to the repository to get the necessary information. It is not a big deal to produce them in case of need, however, just don’t forget.

Feeling uncomfortable because of the many produced rsync commands ? As in part I, just tell find to stop one level above the terminal sub-directory using “-mindepth 3 -maxdepth 3″ at the cost of some speed decrease, though.

Any faster find alternative ?

As fast as find is when walking a docbase’s directories with lots of filestores, it can still take a long time despite the depth and type limits. Like we asked before for rsync, is there any way to speed it up by running several find commands concurrently on their own directory partition ? Such a find could be launched on sub-directory /home/dmadmin/data/dmtest/filestore_1, and others on /home/dmadmin/data/dmtest/filestore_i, all simultaneously. E.g.:

slice=0; for fs in dmtest/./*; do
(( slice += 1 ))
(
find $fs -mindepth 4 -maxdepth 4 -type d | gawk -v nb_rsync=10 -v dest=dmadmin@new_machine:/some/other/place/dmtest 'BEGIN {
print "\#!/bin/bash"
}
{
printf("rsync -avzR %s %s &\n", $0, dest)
if (0 == ++nb_dirs % nb_rsync)
print "wait"
}
END {
if (0 != nb_dirs % nb_rsync)
print "wait"
}' > migr${slice}.sh
chmod +x migr${slice}.sh
./migr${slice}.sh
) &
done

But in such approach, the degree of parallelism is limited by the number of sub-directories to explore (i.e. here, the number of filestores in the repository). What about the script below instead ?

#! /usr/bin/python3
import datetime
import time
import sys
import os
import multiprocessing
import multiprocessing.pool
import getopt

# walks a Documentum tree at a given filesystem location using multi-processes for speed;
# C. Cervini, dbi-services.com, 12/2018;

def Usage():
   print("""Walks a sub-tree and prints the files' full path;
Usage:
walk-tree.py [-h|--help] | [-p |--print] [-r|--root-dir ] [-v|--verbose] [-m|--max_level]
-p |--print for outputting the files found; default action is do nothing, which is handy for just having a count of objects;
-r|--root-dir is the starting directory to explore;
-v|--verbose for outputting timing and count information; off by default;
-m|--max_level stops recursion at the max_level-th tree level, not included; like find's -maxdepth parameter;
Example:
walk-tree.py --print --root-dir ./dmtest/content_storage01/0000c35d/80
walk-tree.py --verbose --root-dir ./dmtest/content_storage01/0000c35d/80
""")

# this function cannot be local to do_tree(), though it is only used there, because of the following error:
#     Can't pickle local object 'do_tree..do_level'
# the error prevents it to be invoked as a local callback;
# actually, functions used in processes must have been entirely defined prior to their usage; this implies that functions cannot fork processes that calls themselves;
# therefore, the master function do_tree is needed that detaches the processes that execute the functions defined earlier, do_level here;
# moreover, processes in a pool are daemonic and are not allowed to fork other processes;
# we must therefore use a global pool of processes allocated in the master function;
def do_level(dir):
   """
   lists the sub-directores and files under dir;
   """
   global gp
   result = {'nb_files': 0, 'nb_dirs': 0, 'level_dirs': []}
   if gp.max_level > 0 and len() - gp.root_level > gp.max_level:
      #print("too deep, returning")
      if gp.bPrint:
         print(dir)
      return result
   for node in os.listdir(dir):
      fullpath = os.path.join(dir, node)
      if os.path.isdir(fullpath):
         result['nb_dirs'] += 1
         result['level_dirs'].append(fullpath)
      elif os.path.isfile(fullpath):
         if gp.bPrint:
            print(fullpath)
         result['nb_files'] += 1
   return result

# the master function;
def do_tree(root):
   global gp, stats
   sub_level_dirs = []

   # list_dirs contains a list of the values returned by parallelized calls to function do_level, i.e. a list of dictionaries;
   # get_dirs is invoked as a callback by map_async once all the processes in the pool have terminated executing do_level() and returned their result;
   def get_dirs(list_dirs):
      global stats
      nonlocal sub_level_dirs
      for l in list_dirs:
         stats.nb_dirs += len(l['level_dirs'])
         stats.nb_files += l['nb_files']
         stats.nb_dirs_with_files += (1 if l['nb_files'] > 0 else 0)
         sub_level_dirs.extend(l['level_dirs'])

   # callback for error reporting;
   def print_error(error):
      print(error)

   # first directory level;
   level_data = do_level(root)
   stats.nb_files += level_data['nb_files']
   stats.nb_dirs += len(level_data['level_dirs'])
   stats.nb_dirs_with_files += (1 if level_data['nb_files'] > 0 else 0)

   # all the other directory sub-levels;
   while level_data['nb_dirs'] > 0:
      sub_level_dirs = []
      gp.pool = multiprocessing.pool.Pool(processes = gp.maxWorkers)
      gp.pool.map_async(do_level, level_data['level_dirs'], len(level_data['level_dirs']), callback = get_dirs, error_callback = print_error)
      gp.pool.close()
      gp.pool.join()
      level_data['nb_files'] = 0
      level_data['nb_dirs'] = len(sub_level_dirs) 
      level_data['level_dirs'] = sub_level_dirs

# -----------------------------------------------------------------------------------------------------------
if __name__ == "__main__":
   # main;
   # global parameters;
   # we use a typical idiom to create a cheap but effective namespace for the global variables used as execution parameters;
   class gp: pass

   # root directory;
   gp.root = None
   gp.bPrint = False
   gp.bVerbose = False
   gp.root_level = 0
   gp.max_level = 0
   try:
       (opts, args) = getopt.getopt(sys.argv[1:], "hpr:m:v", ["help", "print", "root-dir=", "verbose", "max_level="])
   except getopt.GetoptError:
      print("Illegal option")
      print("./walk-tree.py -h|--help | [-p | --print] [-r|--root-dir ] [-v|--verbose] [-m|--max_level]")
      sys.exit(1)
   for opt, arg in opts:
      if opt in ("-h", "--help"):
         Usage()
         sys.exit()
      elif opt in ("-p", "--print"):
         gp.bPrint = True
      elif opt in ("-r", "--root-dir"):
         gp.root = arg
      elif opt in ("-v", "--verbose"):
         gp.bVerbose = True
      elif opt in ("-m", "--max_level"):
         try:
            gp.max_level = int(arg)
         except:
            print("invalid value for max_level")
            sys.exit()
   if None == gp.root:
      print("Error: root_dir must be specified")
      print("Use -h or --help for help")
      sys.exit()
   gp.root_level = len()

   # maximum numbers of allowed processes; tasks will wait until some processes are available;
   # caution not to choose huge values for disk I/Os are saturated and overall performance drops to a crawl;
   gp.maxWorkers = 50

   # again but for the counters;
   class stats: pass
   stats.nb_files = 0
   stats.nb_dirs = 0
   stats.nb_dirs_with_files = 0

   startDate = datetime.datetime.now()
   if gp.bVerbose:
      print(f"started walking {gp.root} at {startDate}")
   do_tree(gp.root)
   endDate = datetime.datetime.now()
   if gp.bVerbose:
      print("{:d} found files".format(stats.nb_files))
      print("{:d} found dirs".format(stats.nb_dirs))
      print("{:d} total found nodes".format(stats.nb_dirs + stats.nb_files))
      print("{:d} total found dirs with files".format(stats.nb_dirs_with_files))
      print(f"Ended walking subtree {gp.root} at {endDate}")
      print(f"subtree {gp.root} walked in {(endDate - startDate).seconds} seconds, or {time.strftime('%H:%M:%S', time.gmtime((endDate - startDate).seconds))} seconds")

Here, we have a pool of workers which receives tasks to explore sub-directories up to a given depth, just like find’s maxdepth. But is it any faster than a sequential find ? On my test laptop with an external USB 3.1 spinning drive, find gives the best result:

sudo sysctl vm.drop_caches=3; time find dmtest/content_storage01 -mindepth 4 -maxdepth 4 -type d | wc -l
vm.drop_caches = 3
1787628
 
real 35m19,989s
user 0m9,316s
sys 0m43,632s

The python script is very close but lags 2 minutes behind with 35 workers:

sudo sysctl vm.drop_caches=3; time ./walk-tree4.py -v --root-dir dmtest/content_storage01 -m 3
vm.drop_caches = 3
started walking dmtest/content_storage01 at 2019-01-01 17:47:39.664421
0 found files
1797049 found dirs
1797049 total found nodes
0 total found dirs with files
Ended walking subtree dmtest/content_storage01 at 2019-01-01 18:25:18.437296
subtree dmtest/content_storage01 walked in 2258 seconds, or 00:37:38 seconds
real 37m38,996s
user 0m34,138s
sys 1m11,747s

Performance gets worst when the number of concurrent workers is increased, likely because the disk is saturated. It was not designed for such intensive use in the first place. Whether with find or the python script, most of the execution time is spent waiting for the I/Os to complete, with very little time spent in user or system code. Obviously, there is little to optimize here, except switch to a faster disks sub-system, e.g. beginning with an SSD drive. Even a find’s parallel equivalent would not bring any improvement, it would only put more pressure on the disk and be counter-productive. But on real production infrastructures, with large enough disk bandwidth, if may be worth parallelizing if the volume is large, and that’s where the script can make a difference.

So, which one is better ?

Both alternatives generates more or less the same rsync commands to execute later; the only difference is the source of the information to produce those commands: the repository in the first alternative and the filesystem in the second.
The first alternative looks simpler and cleaner because it works from a repository and get its information directly from it. But if one wants to be absolutely, positively sure not to forget any content file, the second alternative is better as it works directly from the filesystem; since no queries are run for each of the contents, a lot of time is saved, even despite the required disk walking. It is ideal when an exact clone is needed, orphans, and possibly garbage, included. Its python script variant is interesting in that it can readily take advantage of the faster I/Os through an easy to set concurrency level.

Cet article Two techniques for cloning a repository filestore, part II est apparu en premier sur Blog dbi services.

Two techniques for cloning a repository filestore, part I

Yann Neuhaus - Fri, 2019-01-18 03:26

I must confess that my initial thought for the title was “An optimal repository filestore copy”. Optimal, really ? Relatively to what ? Which variable(s) define(s) the optimality ? Speed/time to clone ? Too dependent on the installed hardware and software, and the available resources and execution constraints. Simplicity to do it ? Too simple a method can result in a very long execution time while complexity can give a faster solution but be fragile, and vice-versa. Besides, simplicity is a relative concept; a solution may look simple to someone and cause nightmares to some others. Beauty ? I like that one but no, too fuzzy too. Finally, I settled for the present title for it is neutral and up to the point. I leave it up to the reader to judge if the techniques are optimal or simple or beautiful. I only hope that they can be useful to someone.
This article has two parts. In each, I’ll give an alternative for copying a repository’s filestores from one filesystem to another. Actually, both techniques are very similar, they just differ in the way the the content files’ path locations are determined. But let’s start.

A few simple alternatives

Cloning a Documentum repository is a well-known procedure nowadays. Globally, it implies to first create a placeholder docbase for the clone and then to copy the meta-data stored in the source database, e.g. through an export/import, usually while the docbase is stopped, plus the document contents, generally stored on disks. If a special storage peripheral is used, such as a Centera CAS or a NAS, there might be a fast, low-level way to clone to content files directly at the device level, check with the manufacturer.
If all we want is an identical copy of the whole contents’ filesystem, the command dd could be used, e.g. supposing that the source docbase and the clone docbase both use for their contents a dedicated filesystem mounted on /dev/sda1 respectively on /dev/sdb1:

sudo dd if=/dev/mapper/vol01 bs=1024K of=/dev/mapper/vol02

The clone’s filesystem /dev/mapper/vol02 could be mounted temporarily on the source docbase’s machine for the copy and later dismounted. If this is not possible, dd can be used over the network, e.g.:

# from the clone machine;
ssh root@source 'dd if=/dev/mapper/vol01 bs=1024K | gzip -1 -' | zcat | dd of=/dev/mapper/vol02
 
# from the source machine as root;
dd if=/dev/mapper/vol01 bs=1024K | gzip -1 - | ssh dmadmin@clone 'zcat | dd of=/dev/mapper/vol02'

dd, and other partition imaging utilities, perform a block by block mirror copy of the source, which is much faster than working at the file level, although it depends on the percentage of used space (if the source filesystem is almost empty, dd will spend most of its time copying unused blocks, which is useless. Here, a simple file by file copy would be more effective). If it is not possible to work at this low level, e.g. filesystems are not dedicated to repositories’ contents or the types of the filesystems differ, then a file by file copy is required. Modern disks are quite fast, especially for deep-pocket companies using SSD, and so file copy operations should be acceptably quick. A naive command such as the one below could even be used to copy the whole repository’s content (we suppose that the clone will be on the same machine):

my_docbase_root=/data/Documentum/my_docbase
dest_path=/some/other/or/identical/path/my_docbase
cp -rp ${my_docbase_root} ${dest_path}/../.

If $dest_path differs from $my_docbase_root, don’t forget to edit the dm_filestore’s dm_locations of the clone docbase accordingly.

If confidentiality is requested and/or the copy occurs across a network, scp is recommended as it also encrypts the transferred data:

scp -rp ${my_docbase_root} dmadmin@${dest_machine}:${dest_path}/../.

The venerable tar command could also be used, on-the-fly and without creating an intermediate archive file, e.g.:

( cd ${my_docbase_root}; tar cvzf - * ) | ssh dmadmin@${dest_machine} "(cd ${dest_path}/; tar xvzf - )"

Even better, the command rsync could be used as it is much more versatile and efficient (and still secure too if configured to use ssh, which is the default), especially if the copy is done live several times in advance during the days preceding the kick-off of the new docbase; such copies will be performed incrementally and will execute quickly, providing an easy way to synchronize the copy with the source. Example of use:

rsync -avz --stats ${my_docbase_root}/ dmadmin@{dest_machine}:${dest_path}

The trailing / in the source path means copy the content of ${loc_path} but not the directory itself as we assumed it already exists.
Alternatively, we can include the directory too:

rsync -avz --stats ${my_docbase_root} dmadmin@{dest_machine}:${dest_path}/../.

If we run it from the destination machine, the command changes to:

rsync -avz --stats dmadmin@source_machine:${my_docbase_root}/ ${dest_path}/.

The –stats option is handy to obtain a summary of the transferred files and the resulting performance.

Still, if the docbase is large and contains millions to hundreds of millions of documents, copying them to another machine can take some time. If rsync is used, repeated executions will just copy over modified or new documents and optionally remove the deleted ones if the archiving mode is requested (the -a option above) but the first run will take time anyway. Logically, reasonably taking advantage of the available I/O bandwidth by having several rsync running at once should reduce the time to clone the filestore, shouldn’t it ? Is it possible to apply here a divide-and-conquer technique and process each part simultaneously ? It is, and here is how.

How Documentum stores the contents on disk

Besides the sheer volume of documents and possibly the limited network and disk I/O bandwidth, one reason the copy can take a long time, independently from the tools used, is the peculiar way Documentum stores its contents on disk, with all the content files exclusively at the bottom of a 6-level deep sub-tree with the following structure (assuming $my_docbase_root has the same value as above; the letter at column 1 means d for directory and f for file):

cd $my_docbase_root
d filestore_1
d <docbase_id>, e.g. 0000c350
d 80 starts with 80 and increases by 1 up to ff, for a total of 2^7 = 128 sub-trees;
d 00 up to 16 ^ 16 = 256 directories directly below 80, from 00 to ff
d 00 again, up to 256 directories directly below, from 00 to ff, up to 64K total subdirectories at this level; let's call these innermost directories "terminal directories";
f 00[.ext] files are here, up to 256 files, from 00 to ff per terminal directory
f 01[.ext] f ...
f ff[.ext] d 01
f 00[.ext] f 01[.ext] f ...
f ff[.ext] d ...
d ff
f 00[.ext] f 01[.ext] f ...
f ff[.ext] d 01
d ...
d ff
d 81
d 00
d 01
d ...
d ff
d ...
d ff
d 00
d 01
d ...
d ff
f 00[.ext] f 01[.ext] f ...
f ff[.ext]

The content files on disk may have an optional extension and are located exclusively at the extremities of their sub-tree; there are no files in the intermediate sub-directories. Said otherwise, the files are the leaves of a filestore directory tree.
All the nodes have a 2-digit lowercase hexadecimal name, from 00 to ff, possibly with holes in the sequences when sub-directories or deleted documents have been swept by the DMClean job. With such a layout, each filestore can store up to (2^7).(2^8).(2^8).(2^8) files, i.e. 2^31 files or a bit more than 2.1 billions files. Any number of filestores can be used for virtually an “unlimited” number of content files. However, since each content object has an 16-digit hexadecimal id whose only last 8 hexadecimal digits are really distinctive (and directly map to the filesystem path of the content file, see below), a docbase can effectively contain “only” up to 16^8 files, i.e. 2^32 content files or slightly more than 4.2 billions files distributed among all the filestores. Too few ? There is hope, aside from spawning new docbases. The knowledge base note here explains how “galactic” r_object_id are allocated if more than 4 billions documents are present in a repository, so it should be possible to have literally gazillions of documents in a docbase. It is not clear though whether this galactic concept is implemented yet or whether it has ever been triggered once, so let us stay with our feet firmly on planet Earth for the time being.

It should be emphasized that such a particular layout in no way causes a performance penalty in accessing the documents’contents from within the repository because their full path can easily be computed by the content server out of their dm_content.data_ticket attribute (a signed decimal number), e.g. as shown with the one-liner:

data_ticket=-2147440384
echo $data_ticket | gawk '{printf("%x\n", $0 + 4294967296)}'
8000a900

or, more efficiently entirely in the bash shell:

printf "%x\n" $(($data_ticket + 4294967296))
8000a900

This value is now split apart by groups of 2 hex digits with a slash as separator: 80/00/a9/00
To compute the full path, the filestore’s location and the docbase id still need to be prepended to the above partial path, e.g. ${my_docbase_root}/filestore_01/0000c350/80/00/a9/00. Thus, knowing the r_object_id of a document, we can find its content file on the filesystem as shown (or, preferably, using the getpath API function) and knowing the full path of a content file makes it possible to find back the document (or documents as the same content can be shared among several documents) in the repository it belongs to. To be complete, the explanation still needs the concept of filestore and its relationship with a location but let’s stop digressing (check paragraph 3 below for a hint) and focus back to the subject at hand. We have now enough information to get us started.

As there are no files in the intermediate levels, it is necessary to walk the entire tree to reach them and start their processing, which is very time-consuming. Depending on your hardware, a ‘ls -1R’ command can takes hours to complete on a set of fairly dense sub-trees. A contrario, this is an advantage for processing through rsync because rsync is able to create all the necessary sub-path levels (aka “implied directories” in rsync lingo) if the -R|–relative option is provided, as if a “mkdir -p” were issued; thus, in order to optimally copy an entire filestore, it would be enough to rsync only the terminal directories, once identified, and the whole sub-tree would be recreated implicitly. In the illustration above, the rsync commands for those paths are:

cd ${my_docbase_root}
rsync -avzR --stats filestore_01/80/00/00 dmadmin@{dest_machine}:${dest_path}
rsync -avzR --stats filestore_01/80/00/01 dmadmin@{dest_machine}:${dest_path}
rsync -avzR --stats filestore_01/80/00/ff dmadmin@{dest_machine}:${dest_path}
rsync -avzR --stats filestore_01/ff/ff/00 dmadmin@{dest_machine}:${dest_path}

In rsync ≥ v2.6.7, it is even possible to restrict the part within the source full path that should be copied remotely, so no preliminary cd is necessary, e.g.:

rsync -avzR --stats ${my_docbase_root}/./filestore_01/80/00/00 dmadmin@{dest_machine}:${dest_path}

Note the /./ path component, it marks the start of the relative path to reproduce remotely. This command will create the directory ${dest_path}/filestore_01/80/00/00 on the remote host and copy its content there.
Path specification can be quite complicated, so use the -n|–dry-run and -v|–verbose (or even -vv for more details) options to have a peek at rsync’s actions before they are applied.

With the -R option, we get to transfer only the terminal sub-directories AND their original relative paths, efficiency and convenience !
We potentially replace millions of file by file copy commands with only up to 128 * 64K directory copy commands per filestore, which is much more concise and efficient.

However, if there are N content files to transfer, at least ⌈N/256⌉ such rsync commands will be necessary, e.g. a minimum of 3’900 commands for 1 million files, subject of course to their distribution in the sub-tree (some lesser dense terminal sub-directories can contain less than 256 files so more directories and hence commands are required). It is not documented how the content server distributes them over the sub-directories and there is no balancing job that relocates the content files in order to reduce the number of terminal directories by increasing the density of the left ones and removing the emptied ones. All this is quite sensible because, while it may matter to us, it is a non-issue to the repository.
Nevertheless, on the bright side, since a tree is by definition acyclic, rsync transfers won’t overlap and therefore can be parallelized without synchronization issues, even when intermediate “implied directories”‘s creations are requested simultaneously by 2 or more rsync commands (if rsync performs an operation equivalent to “mkdir -p”, possible errors due to race conditions during concurrent execution can be simply ignored since the operation is idempotent).

Empty terminal paths can be skipped without fear because they are not referenced in the docbase (only content files are) and hence their absence from the copy cannot introduce inconsistencies.

Of course, in the example above, those hypothetical 3900 rsync commands won’t be launched at once but in groups of some convenient value depending on the load that the machines can endure or the application’s response time degradation if the commands are executed during the normal work hours. Since we are dealing with files and possibly the network, the biggest bottleneck will be the I/Os and care should be exercised not to saturate them. When dedicated hardware such as high-speed networked NAS with SSD drives are used, this is less of a problem and more parallel rsync instances can be started, but such expensive resources are often shared across a company so that someone might still be impacted at one point. I for one remember one night as I was running one data-intensive DQL query in ten different repositories at once and users were suddenly complaining that they couldn’t work their Word documents any more because response time fell down to a crawl. How was that possible ? What was the relationship between DQL queries in several repositories and a desktop program ? Well, the repositories used Oracle databases whose datafiles were stored on a NAS also used as a host for networked drives mounted on desktop machines. Evidently, that configuration was somewhat sloppy but one never knows how things are configured at a large company, so be prepared for the worst and set up the transfer so that they can be easily suspended, interrupted and resumed. The good thing with rsync in archive mode is that the transfers can be resumed where they left off at a minimal cost just by relaunching the same commands with no need to compute a restarting point.

It goes without saying that setting up public key authentication or ssh connection sharing is mandatory when rsync-ing to a remote machine in order to suppress the thousands of authentication requests that will pop up during the batch execution.

But up to 128 * 64K rsync commands per filestore : isn’t that a bit too much ?

rsync performs mostly I/Os operations, reading and writing the filesystems, possibly across a network. The time spent in those slow operations (mostly waiting for them to complete, actually) by far outweighs the time spent launching the rsync processes and executing user or system code, especially if the terminal sub-directories are densely filled with large files. Moreover, if the DMClean jobs has been run ahead of time, this 128 * 64K figure is a maximum, it is only reached if all of the terminal sub-directories are not empty.
Still, rsync has some cleverness of its own in processing the files to copy so why not let it do its job ? Is it possible to reduce their number ? Of course, by just stopping one level before the terminal sub-directories, at their parents’ level. From there, up to 128 * 256 rsync commands are necessary, down from 128 * 64K commands. rsync would then explore itself the up to 64K terminal directories below, hopefully more efficiently than when explicitly told so. For sparse sub-directories or small docbases, this could be more efficient. If so, what would be the cut-off point ? This is a complex question depending on so many factors that there is no other way to answer it than to experiment with several situations. A few informal tests show that copying terminal sub-directories with up to 64K rsync commands is about 40% faster than copying their parent sub-directories. If optimality is defined as speed, then the “64k” variant is the best; if it is defined as compactness, then the “256” variant is the best. One explanation for this could be that the finer and simpler the tasks to perform, the quicker they terminate and free up processes to be reused. Or maybe rsync is overwhelmed by the up to 64K sub-directories to explore and is not so good at that and needs some help. The scripts in this article allow experimenting with the “256” variant.

To summarize up to this point, we will address the cost of navigating the special Documentum sub-tree layout by walking the location sub-trees up to the last directory level (or up to 2 levels above if so requested) and generate efficient rsync commands that can easily be parallelized. But before we start, how about asking the repository about its contents ? As it keeps track of it, wouldn’t this alternative be much easier and faster than navigating complex directory sub-trees ? Let’see.

Alternate solution: just ask the repository !

Since a repository obviously knows where its content files are stored on disks, it makes sense to get this information directly from it. In order to be sure to include all the possible renditions as well, we should query dmr_content instead of dm_document(all) (note that the DQL function MFILE_URL() returns those too, so that a “select MFILE_URL(”) from dm_document(all)” could also be used here). Also, unless the dmDMClean job is run beforehand, dmr_content includes orphan contents as well, so this point must be clarified ahead. Anyway, by querying dmr_content we are sure not to omit any content, orphans or not.
The short python/bash-like pseudo-code shows how we could do it:

for each filestore in the repository:
(
   for each content in the filestore:
      get its path, discard its filename;
      add the path to the set of paths to transfer;
   for each path in the set of paths:
      generate an rsync -avzR --stat command;
) &

Line 4 just gets the terminal sub-directories, while line 5 ensures that they are unique in order to avoid rsync-ing the same path multiple times. We use sets here to guarantee distinct terminal path values (set elements are unique in the set).
Line 7 outputs the rsync commands for the terminal sub-directories and the destination.
Even though all filestores are processed concurrently, there could be millions of contents in each filestore and such queries could take forever. However, if we run several such queries in parallel, each working on its own partition (i.e. a non-overlapping subset of the whole such that their union is equal to the whole), we could considerably speed it up. Constraints such “where r_object_id like ‘%0′”, “where r_object_id like ‘%1′”, “where r_object_id like ‘%2′”, .. “where r_object_id like ‘%f'” can slice up the whole set of documents into 16 more or less equally-sized subsets (since the r_object_id is essentially a sequence, its modulo 16 or 256 distribution is uniform), which can then be worked on independently and concurrently. Constraints like “where r_object_id like ‘%00′” .. “where r_object_id like ‘%ff'” can produce 256 slices, and so on.
Here is a short python 3 program that does all this:

#!/usr/bin/env python

# 12/2018, C. Cervini, dbi-services;
 
import os
import sys
import traceback
import getopt
from datetime import datetime
import DctmAPI
import multiprocessing
import multiprocessing.pool

def Usage():
   print("""Purpose:
Connects as dmadmin/xxxx to a given repository and generates rsync commands to transfer the given filestores' contents to the given destination;
Usage:
   ./walk_docbase.py -h|--help | -r|--repository  [-f|--filestores [{,}]|all] [-d|--dest ] [-l|--level ]
Example:
   ./walk_docbase.py --docbase dmtest
will list all the filestores in docbase dmtest and exit;
   ./walk_docbase.py --docbase dmtest --filestore all --dest dmadmin@remote-host:/documentum/data/cloned_dmtest
will transfer all the filestores' content to the remote destination in /documentum/data, e.g.:
   dm_location.path = /data/dctm/dmtest/filestore_01 --> /documentum/data/cloned_dmtest/filestore_01
if dest does not contain a file path, the same path as the source is used, e.g.:
   ./walk_docbase.py --docbase dmtest --filestore filestore_01 --dest dmadmin@remote-host
will transfer the filestore_01 filestore's content to the remote destination into the same directory, e.g.:
   dm_location.path = /data/dctm/dmtest/filestore_01 --> /documentum/dctm/dmtest/filestore_01
In any case, the destination root directory, if any is given, must exist as rsync does not create it (although it creates the implied directories);
Generated statements can be dry-tested by adding the option --dry-run, e.g.
rsync -avzR --dry-run /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0c dmadmin@dmtest:/home/dctm/dmtest
Commented out SQL statements to update the dm_location.file_system_path for each dm_filestore are output if the destination path differs from the source's;
level is the starting sub-directory level that will be copied by rsync;
Allowd values for level are 0 (the default, the terminal directories level), -1 and -2;
level -1 means the sub-directory level above the terminal directories, and so on;
Practically, use 0 for better granularity and parallelization;
""")

def print_error(error):
   print(error)

def collect_results(list_paths):
   global stats
   for s in list_paths:
      stats.paths = stats.paths.union(s)

def explore_fs_slice(stmt):
   unique_paths = set()
   lev = level
   try:
      for item in DctmAPI.select_cor(session, stmt):
         fullpath = DctmAPI.select2dict(session, f"execute get_path for '{item['r_object_id']}'")
         last_sep = fullpath[0]['result'].rfind(os.sep)
         fullpath = fullpath[0]['result'][last_sep - 8 : last_sep]
         for lev in range(level, 0):
            last_sep = fullpath.rfind(os.sep)
            fullpath = fullpath[ : last_sep]
         unique_paths.add(fullpath)
   except Exception as e:
      print(e)
      traceback.print_stack()
   DctmAPI.show(f"for stmt {stmt}, unique_paths={unique_paths}")
   return unique_paths

# --------------------------------------------------------
# main;
if __name__ == "__main__":
   DctmAPI.logLevel = 0
 
   # parse the command-line parameters;
   # old-style for more flexibility is not needed here;
   repository = None
   s_filestores = None
   dest = None
   user = ""
   dest_host = ""
   dest_path = ""
   level = 0
   try:
       (opts, args) = getopt.getopt(sys.argv[1:], "hr:f:d:l:", ["help", "docbase=", "filestore=", "destination=", "level="])
   except getopt.GetoptError:
      print("Illegal option")
      print("./graph-stats.py -h|--help | [-r|--repository ] [-f|--filestores [{,}]|all] [-d|--dest ][-l|--level ]")
      sys.exit(1)
   for opt, arg in opts:
      if opt in ("-h", "--help"):
         Usage()
         sys.exit()
      elif opt in ("-r", "--repository"):
         repository = arg
      elif opt in ("-f", "--filestores"):
         s_filestores = arg
      elif opt in ("-d", "--dest"):
         dest = arg
         p_at = arg.rfind("@")
         p_colon = arg.rfind(":")
         if -1 != p_at:
            user = arg[ : p_at]
            if -1 != p_colon:
               dest_host = arg[p_at + 1 : p_colon]
               dest_path = arg[p_colon + 1 : ]
            else:
               dest_path = arg[p_at + 1 : ]
         elif -1 != p_colon:
            dest_host = arg[ : p_colon]
            dest_path = arg[p_colon + 1 : ]
         else:
            dest_path = arg
      elif opt in ("-l", "--level"):
         try:
            level = int(arg)
            if -2 > level or level > 0:
               print("raising")
               raise Exception()
         except:
            print("level must be a non positive integer inside the interval (-2,  0)")
            sys.exit()
   if None == repository:
      print("the repository is mandatory")
      Usage()
      sys.exit()
   if None == dest or "" == dest:
      if None != s_filestores:
         print("the destination is mandatory")
         Usage()
         sys.exit()
   if None == s_filestores or 'all' == s_filestores:
      # all filestores requested;
      s_filestores = "all"
      filestores = None
   else:
      filestores = s_filestores.split(",")
 
   # global parameters;
   # we use a typical idiom to create a cheap namespace;
   class gp: pass
   gp.maxWorkers = 100
   class stats: pass

   # connect to the repository;
   DctmAPI.show(f"Will connect to docbase(s): {repository} and transfer filestores [{s_filestores}] to destination {dest if dest else 'None'}")

   status = DctmAPI.dmInit()
   session = DctmAPI.connect(docbase = repository, user_name = "dmadmin", password = "dmadmin")
   if session is None:
      print("no session opened, exiting ...")
      exit(1)

   # we need the docbase id in hex format;
   gp.docbase_id = "{:08x}".format(int(DctmAPI.dmAPIGet("get,c,docbaseconfig,r_docbase_id")))

   # get the requested filestores' dm_locations;
   stmt = 'select fs.r_object_id, fs.name, fs.root, l.r_object_id as "loc_id", l.file_system_path from dm_filestore fs, dm_location l where {:s}fs.root = l.object_name'.format(f"fs.name in ({str(filestores)[1:-1]}) and " if filestores else "")
   fs_dict = DctmAPI.select2dict(session, stmt)
   DctmAPI.show(fs_dict)
   if None == dest:
      print(f"filestores in repository {repository}")
      for s in fs_dict:
         print(s['name'])
      sys.exit()

   # filestores are processed sequentially but inside each filestore, the contents are queried concurrently;
   for storage_id in fs_dict:
      print(f"# rsync statements for filestore {storage_id['name']};")
      stats.paths = set()
      stmt = f"select r_object_id from dmr_content where storage_id = '{storage_id['r_object_id']}' and r_object_id like "
      a_stmts = []
      for slice in range(16):
         a_stmts.append(stmt + "'%0{:0x}'".format(slice))
      gp.path_pool = multiprocessing.pool.Pool(processes = 16)
      gp.path_pool.map_async(explore_fs_slice, a_stmts, len(a_stmts), callback = collect_results, error_callback = print_error)
      gp.path_pool.close()
      gp.path_pool.join()
      SQL_stmts = set()
      for p in stats.paths:
         last_sep = storage_id['file_system_path'].rfind(os.sep)
         if "" == dest_path:
            dp = storage_id['file_system_path'][ : last_sep] 
         else:
            dp = dest_path
         # note the dot in the source path: relative implied directories will be created from that position; 
         print(f"rsync -avzR {storage_id['file_system_path'][ : last_sep]}/.{storage_id['file_system_path'][last_sep : ]}/{str(gp.docbase_id)}/{p} {(user + '@') if user else ''}{dest_host}{':' if dest_host else ''}{dp}")
         if storage_id['file_system_path'][ : last_sep] != dest_path:
            # dm_location.file_system_path has changed, give the SQL statements to update them in clone's schema;
            SQL_stmts.add(f"UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '{storage_id['file_system_path'][ : last_sep]}', '{dest_path}') WHERE r_object_id = '{storage_id['loc_id']}';")
      # commented out SQL statements to run before starting the repository clone;
      for stmt in SQL_stmts:
         print(f"# {stmt}")

   status = DctmAPI.disconnect(session)
   if not status:
      print("error while  disconnecting")

On line 10, the module DctmAPI is imported; such module was presented in a previous article (see here) but I include an updated version at the end of the present article.
Note the call to DctmAPI.select_cor() on line 51; this is a special version of DctmAPI.select2dict() where _cor stands for coroutine; actually, in python, it is called a generator but it looks very much like a coroutine from other programming languages. Its main interest is to separate navigating through the result set from consuming the returned data, for more clarity; also, since the the data are consumed one row at a time, there is no need to read them all into memory at once and pass them to the caller, which is especially efficient here where we can potentially have millions of documents. DctmAPI.select2dict() is still available and used when the expected result set is very limited, as for the list of dm_locations on line 154. By the way, DctmAPI.select2dict() invokes DctmAPI.select_cor() from within a list constructor, so they share that part of the code.
On line 171, function map_async is used to start 16 concurrent calls to explore_fs_slice on line 47 per filestore (the r_object_id % 16 expressed in DQL as r_object_id like ‘%0′ .. ‘%f’), each in its own process. That function repeatedly gets an object_id from the coroutine above and calls the administrative method get_path on it (we could query the dm_content.data_ticket and compute ourselves the file path but would it be any faster ?); the function returns a set of unique paths for its slice of ids. map_async then waits until all the processes terminate. Their result is collected by the callback collect_results starting on line 42; its parameter, list_paths, is a list of sets received from map_async (which received the sets from the terminating concurrent invocation of explore_fs_slice and put them in a list) that are further made unique by union-ing them into a global set. Starting on line 175, this set is iterated to generate the rsync commands.
Example of execution:

./walk-docbase.py -r dmtest -f all -d dmadmin@dmtest:/home/dctm/dmtest | tee clone-content
# rsync statements for filestore filestore_01;
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0c dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/02 dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/06 dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0b dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/04 dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/09 dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/01 dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/03 dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/07 dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/00 dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/05 dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0a dmadmin@dmtest:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/08 dmadmin@dmtest:/home/dctm/dmtest
# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '/home/dmadmin/documentum/data/dmtest', '/home/dctm/dmtest') WHERE r_object_id = '3a00c3508000013f';
# rsync statements for filestore thumbnail_store_01;
# rsync statements for filestore streaming_store_01;
# rsync statements for filestore replicate_temp_store;
# rsync statements for filestore replica_filestore_01;

This execution generated the rsync commands to copy all the dmtest repository’s filestores to the remote host dmtest’s new location /home/dctm/dmtest. As the filestores’ dm_location has changed, an SQL statement (to be taken as an example because it is for an Oracle RDBMS; the syntax may differ in another RDBMS) has been generated too to accommodate the new path. We do this in SQL because the docbase clone will still be down at this time and the change must be done at the database level.
The other default filestores in the test docbase are empty and so no rsync are necessary for them; normally, the placeholder docbase already has initialized their sub-tree.
Those rsync commands could be executed in parallel, say, 10 at a time, by launching them in the background with “wait” commands inserted in between, like this:

./walk-docbase.py -r dmtest -f all -d dmadmin@dmtest:/home/dctm/dmtest | gawk -v nb_rsync=10 'BEGIN {
print "\#!/bin/bash"
nb_dirs = 0
}
{
print $0 " &"
if (!$0 || match($0, /^#/)) next
if (0 == ++nb_dirs % nb_rsync)
print "wait"
}
END {
if (0 != nb_dirs % nb_rsync)
print "wait"
}' | tee migr.sh
#!/bin/bash
# rsync statements for filestore filestore_01; &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/08 dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/02 dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/04 dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/03 dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/06 dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/01 dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0a dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/07 dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0c dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/05 dmadmin@dmtest:/home/dctm/dmtest &
wait
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/00 dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/09 dmadmin@dmtest:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0b dmadmin@dmtest:/home/dctm/dmtest &
# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '/home/dmadmin/documentum/data/dmtest', '/home/dctm/dmtest') WHERE r_object_id = '3a00c3508000013f';
# rsync statements for filestore thumbnail_store_01;
# rsync statements for filestore streaming_store_01;
# rsync statements for filestore replicate_temp_store;
# rsync statements for filestore replica_filestore_01;
wait

Even though there may be a count difference of up to 255 files between some rsync commands, they should complete roughly at the same time so that nb_rsync commands should be running at any time. If not, i.e. if the transfers frequently wait for a few long running rsync to complete (it could happen with huge files), it may be worth using a task manager that makes sure the requested parallelism degree is respected at any one time throughout the whole execution.
Let’s now make the generated script executable and launch it:

chmod +x migr.sh
time ./migr.sh

The parameter level lets one choose the levels of the sub-directories that rsync will copy, 0 (the default) for terminal sub-directories, -1 for the level right above them and -2 for the level above those. As discussed, the lesser the level, the lesser rsync commands are necessary, e.g. up to 128 * 64K for level = 0, up to 128 * 256 for level = -1 and up to 128 for level = -2.
Example of execution:

./walk-docbase.py -r dmtest -d dd --level -1
# rsync statements for filestore filestore_01;
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00 dmadmin@dmtest:/home/dctm/dmtest &
# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '/home/dmadmin/documentum/data/dmtest', 'dd') WHERE r_object_id = '3a00c3508000013f';
# rsync statements for filestore thumbnail_store_01;
# rsync statements for filestore streaming_store_01;
# rsync statements for filestore replicate_temp_store;
# rsync statements for filestore replica_filestore_01;

And also:

./walk-docbase.py -r dmtest -d dd --level -2
# rsync statements for filestore filestore_01;
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80 dmadmin@dmtest:/home/dctm/dmtest &
# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '/home/dmadmin/documentum/data/dmtest', 'dd') WHERE r_object_id = '3a00c3508000013f';
# rsync statements for filestore thumbnail_store_01;
# rsync statements for filestore streaming_store_01;
# rsync statements for filestore replicate_temp_store;
# rsync statements for filestore replica_filestore_01;

So, this first solution looks quite simple eventhough it initially puts a little, yet tolerable stress on the docbase. The python script connects to the repository and generates the required rsync commands (with an user-selectable compactness level) and a gawk filter prepares an executable with those statements launched in parallel N (user-selectable) at a time.
Performance-wise, its not so good because all the contents must be queried for their full path, and that’s a lot of queries for a large repository.

All this being said, let’s see now if a direct and faster, out of the repository filesystem copy procedure can be devised. Please, follow the rest of this article in part II. The next paragraph just lists the latest version of the module DctmAPI.py.

DctmAPI.py revisited
"""
This module is a python - Documentum binding based on ctypes;
requires libdmcl40.so/libdmcl.so to be reachable through LD_LIBRARY_PATH;
C. Cervini - dbi-services.com - december 2018

The binding works as-is for both python2 amd python3; no recompilation required; that's the good thing with ctypes compared to e.g. distutils/SWIG;
Under a 32-bit O/S, it must use the libdmcl40.so, whereas under a 64-bit Linux it must use the java backed one, libdmcl.so;

For compatibility with python3 (where strings are now unicode ones and no longer arrays of bytes, ctypes strings parameters are always converted to unicode, either by prefixing them
with a b if litteral or by invoking their encode('ascii', 'ignore') method; to get back to text from bytes, b.decode() is used;these works in python2 as well as in python3 so the source is compatible with these two versions of the language;
"""

import os
import ctypes
import sys, traceback

# use foreign C library;
# use this library in eContent server = v6.x, 64-bit Linux;
dmlib = 'libdmcl.so'

dm = 0

class getOutOfHere(Exception):
   pass

def show(mesg, beg_sep = False, end_sep = False):
   "displays the message msg if allowed"
   if logLevel > 0:
      print(("\n" if beg_sep else "") + repr(mesg), ("\n" if end_sep else ""))

def dmInit():
   """
   initializes the Documentum part;
   returns True if successfull, False otherwise;
   dmAPI* are global aliases on their respective dm.dmAPI* for some syntaxic sugar;
   since they already have an implicit namespace through their dm prefix, dm.dmAPI* would be redundant so let's get rid of it;
   returns True if no error, False otherwise;
   """

   show("in dmInit()")
   global dm

   try:
      dm = ctypes.cdll.LoadLibrary(dmlib);  dm.restype = ctypes.c_char_p
      show("dm=" + str(dm) + " after loading library " + dmlib)
      dm.dmAPIInit.restype    = ctypes.c_int;
      dm.dmAPIDeInit.restype  = ctypes.c_int;
      dm.dmAPIGet.restype     = ctypes.c_char_p;      dm.dmAPIGet.argtypes  = [ctypes.c_char_p]
      dm.dmAPISet.restype     = ctypes.c_int;         dm.dmAPISet.argtypes  = [ctypes.c_char_p, ctypes.c_char_p]
      dm.dmAPIExec.restype    = ctypes.c_int;         dm.dmAPIExec.argtypes = [ctypes.c_char_p]
      status  = dm.dmAPIInit()
   except Exception as e:
      print("exception in dminit(): ")
      print(e)
      traceback.print_stack()
      status = False
   finally:
      show("exiting dmInit()")
      return True if 0 != status else False
   
def dmAPIDeInit():
   """
   releases the memory structures in documentum's library;
   returns True if no error, False otherwise;
   """
   status = dm.dmAPIDeInit()
   return True if 0 != status else False
   
def dmAPIGet(s):
   """
   passes the string s to dmAPIGet() method;
   returns a non-empty string if OK, None otherwise;
   """
   value = dm.dmAPIGet(s.encode('ascii', 'ignore'))
   return value.decode() if value is not None else None

def dmAPISet(s, value):
   """
   passes the string s to dmAPISet() method;
   returns TRUE if OK, False otherwise;
   """
   status = dm.dmAPISet(s.encode('ascii', 'ignore'), value.encode('ascii', 'ignore'))
   return True if 0 != status else False

def dmAPIExec(stmt):
   """
   passes the string stmt to dmAPIExec() method;
   returns TRUE if OK, False otherwise;
   """
   status = dm.dmAPIExec(stmt.encode('ascii', 'ignore'))
   return True if 0 != status else False

def connect(docbase, user_name, password):
   """
   connects to given docbase as user_name/password;
   returns a session id if OK, None otherwise
   """
   show("in connect(), docbase = " + docbase + ", user_name = " + user_name + ", password = " + password) 
   try:
      session = dmAPIGet("connect," + docbase + "," + user_name + "," + password)
      if session is None or not session:
         raise(getOutOfHere)
      else:
         show("successful session " + session)
         show(dmAPIGet("getmessage," + session).rstrip())
   except getOutOfHere:
      print("unsuccessful connection to docbase " + docbase + " as user " + user_name)
      session = None
   except Exception as e:
      print("Exception in connect():")
      print(e)
      traceback.print_stack()
      session = None
   finally:
      show("exiting connect()")
      return session

def execute(session, dql_stmt):
   """
   execute non-SELECT DQL statements;
   returns TRUE if OK, False otherwise;
   """
   show("in execute(), dql_stmt=" + dql_stmt)
   try:
      query_id = dmAPIGet("query," + session + "," + dql_stmt)
      if query_id is None:
         raise(getOutOfHere)
      err_flag = dmAPIExec("close," + session + "," + query_id)
      if not err_flag:
         raise(getOutOfHere)
      status = True
   except getOutOfHere:
      show(dmAPIGet("getmessage," + session).rstrip())
      status = False
   except Exception as e:
      print("Exception in execute():")
      print(e)
      traceback.print_stack()
      status = False
   finally:
      show(dmAPIGet("getmessage," + session).rstrip())
      show("exiting execute()")
      return status

def select2dict(session, dql_stmt, attr_name = None):
   """
   execute the DQL SELECT statement passed in dql_stmt and returns an array of dictionaries (one per row) into result;
   attributes_names is the list of extracted attributes (the ones in SELECT ..., as interpreted by the server); if not None, attribute namea are appended to it, otherwise nothing is returned;
   """
   show("in select2dict(), dql_stmt=" + dql_stmt)
   return list(select_cor(session, dql_stmt, attr_name))

def select_cor(session, dql_stmt, attr_name = None):
   """
   execute the DQL SELECT statement passed in dql_stmt and return one row at a time;
   coroutine version;
   if the optional attributes_names is not None, it contains an appended list of attributes returned by the result set, otherwise no names are returned;
   return True if OK, False otherwise;
   """
   show("in select_cor(), dql_stmt=" + dql_stmt)

   status = False
   try:
      query_id = dmAPIGet("query," + session + "," + dql_stmt)
      if query_id is None:
         raise(getOutOfHere)

      # iterate through the result set;
      row_counter = 0
      if None == attr_name:
         attr_name = []
      width = {}
      while dmAPIExec("next," + session + "," + query_id):
         result = {}
         nb_attrs = dmAPIGet("count," + session + "," + query_id)
         if nb_attrs is None:
            show("Error retrieving the count of returned attributes: " + dmAPIGet("getmessage," + session))
            raise(getOutOfHere)
         nb_attrs = int(nb_attrs) 
         for i in range(nb_attrs):
            if 0 == row_counter:
               # get the attributes' names only once for the whole query;
               value = dmAPIGet("get," + session + "," + query_id + ",_names[" + str(i) + "]")
               if value is None:
                  show("error while getting the attribute name at position " + str(i) + ": " + dmAPIGet("getmessage," + session))
                  raise(getOutOfHere)
               attr_name.append(value)
               if value in width:
                  width[value] = max(width[attr_name[i]], len(value))
               else:
                  width[value] = len(value)

            is_repeating = dmAPIGet("repeating," + session + "," + query_id + "," + attr_name[i])
            if is_repeating is None:
               show("error while getting the arity of attribute " + attr_name[i] + ": " + dmAPIGet("getmessage," + session))
               raise(getOutOfHere)
            is_repeating = int(is_repeating)

            if 1 == is_repeating:
               # multi-valued attributes;
               result[attr_name[i]] = []
               count = dmAPIGet("values," + session + "," + query_id + "," + attr_name[i])
               if count is None:
                  show("error while getting the arity of attribute " + attr_name[i] + ": " + dmAPIGet("getmessage," + session))
                  raise(getOutOfHere)
               count = int(count)

               for j in range(count):
                  value = dmAPIGet("get," + session + "," + query_id + "," + attr_name[i] + "[" + str(j) + "]")
                  if value is None:
                     value = "null"
                  #result[row_counter] [attr_name[i]].append(value)
                  result[attr_name[i]].append(value)
            else:
               # mono-valued attributes;
               value = dmAPIGet("get," + session + "," + query_id + "," + attr_name[i])
               if value is None:
                  value = "null"
               width[attr_name[i]] = len(attr_name[i])
               result[attr_name[i]] = value
         if 0 == row_counter:
            show(attr_name.append)
         yield result
         row_counter += 1
      err_flag = dmAPIExec("close," + session + "," + query_id)
      if not err_flag:
         show("Error closing the query collection: " + dmAPIGet("getmessage," + session))
         raise(getOutOfHere)

      status = True

   except getOutOfHere:
      show(dmAPIGet("getmessage," + session).rstrip())
      status = False
   except Exception as e:
      print("Exception in select2dict():")
      print(e)
      traceback.print_stack()
      status = False
   finally:
      return status

def select(session, dql_stmt, attribute_names):
   """
   execute the DQL SELECT statement passed in dql_stmt and outputs the result to stdout;
   attributes_names is a list of attributes to extract from the result set;
   return True if OK, False otherwise;
   """
   show("in select(), dql_stmt=" + dql_stmt)
   try:
      query_id = dmAPIGet("query," + session + "," + dql_stmt)
      if query_id is None:
         raise(getOutOfHere)

      s = ""
      for attr in attribute_names:
         s += "[" + attr + "]\t"
      print(s)
      resp_cntr = 0
      while dmAPIExec("next," + session + "," + query_id):
         s = ""
         for attr in attribute_names:
            value = dmAPIGet("get," + session + "," + query_id + "," + attr)
            if "r_object_id" == attr and value is None:
               raise(getOutOfHere)
            s += "[" + (value if value else "None") + "]\t"
            show(str(resp_cntr) + ": " + s)
         resp_cntr += 1
      show(str(resp_cntr) + " rows iterated")

      err_flag = dmAPIExec("close," + session + "," + query_id)
      if not err_flag:
         raise(getOutOfHere)

      status = True
   except getOutOfHere:
      show(dmAPIGet("getmessage," + session).rstrip())
      status = False
   except Exception as e:
      print("Exception in select():")
      print(e)
      traceback.print_stack()
      print(resp_cntr); print(attr); print(s); print("[" + value + "]")
      status = False
   finally:
      show("exiting select()")
      return status

def walk_group(session, root_group, level, result):
   """
   recursively walk a group hierarchy with root_group as top parent;
   """

   try:
      root_group_id = dmAPIGet("retrieve," + session + ",dm_group where group_name = '" + root_group + "'")
      if 0 == level:
         if root_group_id is None:
            show("Cannot retrieve group [" + root_group + "]:" + dmAPIGet("getmessage," + session))
            raise(getOutOfHere)
      result[root_group] = {}

      count = dmAPIGet("values," + session + "," + root_group_id + ",groups_names")
      if "" == count:
         show("error while getting the arity of attribute groups_names: " + dmAPIGet("getmessage," + session))
         raise(getOutOfHere)
      count = int(count)

      for j in range(count):
         value = dmAPIGet("get," + session + "," + root_group_id + ",groups_names[" + str(j) + "]")
         if value is not None:
            walk_group(session, value, level + 1, result[root_group])

   except getOutOfHere:
      show(dmAPIGet("getmessage," + session).rstrip())
   except Exception as e:
      print("Exception in walk_group():")
      print(e)
      traceback.print_stack()

def disconnect(session):
   """
   closes the given session;
   returns True if no error, False otherwise;
   """
   show("in disconnect()")
   try:
      status = dmAPIExec("disconnect," + session)
   except Exception as e:
      print("Exception in disconnect():")
      print(e)
      traceback.print_stack()
      status = False
   finally:
      show("exiting disconnect()")
      return status

Highlighted are the main changed lines that added a generator-based select() function yielding a dictionary for each row from the result set and changed select2dict() to use it.

Cet article Two techniques for cloning a repository filestore, part I est apparu en premier sur Blog dbi services.

Review user and application access with Oracle IDCS Audit Reports

Oracle Identity Cloud Service provides identity management, single-sign-on (SSO) and identity governance for applications on-premise, in the cloud and mobile applications. Audit events enable...

We share our skills to maximize your revenue!
Categories: DBA Blogs

Oracle Expands Cloud Business with Next-Gen Data Center in Canada

Oracle Press Releases - Thu, 2019-01-17 10:30
Press Release
Oracle Expands Cloud Business with Next-Gen Data Center in Canada New Toronto data center supports growing customer and partner demand for Oracle Cloud Infrastructure

Toronto—Jan 17, 2019

Oracle today announced the opening of a Toronto data center to support in region customer demand for Oracle’s public cloud, Oracle Cloud Infrastructure. Oracle’s next-generation cloud infrastructure offers the most flexibility in the public cloud, allowing companies to run traditional and cloud-native workloads on the same platform. With Oracle’s modern cloud regions, only Oracle can deliver the industry’s broadest, deepest, and fastest growing suite of cloud applications, Oracle Autonomous Database, and new services in security, Blockchain and Artificial Intelligence, all running on its enterprise-grade cloud infrastructure.

“Enterprises in the region still have limited ability to run mission-critical applications in the cloud and are struggling to attain the level of performance they have on-premises without a major overhaul,“ said Don Johnson, executive vice president, product development, Oracle Cloud Infrastructure. “With this new location Oracle is delivering on its promise to deliver even more customers with consistent high performance, low predictable pricing and the flexibility our cloud brings to the table.“

While cloud adoption has increased, many organizations are still hesitant to make the transition to the cloud due to security concerns and a desire to protect existing investments. First generation public cloud offerings were not architected to accommodate traditional application architectures. Oracle’s next-generation cloud infrastructure is built specifically to help organizations of any size run the most demanding workloads securely while delivering unmatched security, performance, and cost savings.

“Oracle’s innovative cloud technologies will help our commercial and public sector customers in the region transform their business and improve citizen services. The new Toronto data center is supporting the fast-growing customer demand for an enterprise-grade cloud in the region. Canada is a very strategic marketplace for Oracle and we are excited to continue to invest in the country,“ said Rich Geraffo, Executive Vice President, North American Technology Division, Oracle.

By the end of this year, the company plans to open additional regions in Australia, Europe, Japan, South Korea, India, Brazil, the Middle East, and the United States, including Virginia, Arizona, and Illinois to support public sector and Department of Defense customers. This expansion complements an existing Edge network consisting of more than 30 global locations and 300 plus sensors, providing Oracle customers with a comprehensive Internet performance data set, and deep edge services capabilities.

Supporting Quotes

“Our work with Oracle and Oracle Cloud Infrastructure FastConnect has been extremely successful in delivering customers with improved performance, enhanced control, flexibility, security and scalability to critical business data and functions. This new Toronto region will build on this success and support the growing customer demand for high performance, low cost cloud solutions.” – Bill Fathers, Chairman & CEO, Cologix

“Domtar makes products that people around the world rely on every day which is why it’s important for us to have technology we can rely on. We have selected Oracle Cloud Infrastructure to support our enterprise, mission-critical workloads so we can continue to our commitment of innovation for a sustainable and better future. The new region in the country we are based is a great development for us.” – Domtar

“At Frozen Mountain, we develop products that are relevant and effective in meeting business needs in a world of evolving technology. Frozen Mountain has been working with Oracle to help deliver clients and partners with the cloud infrastructure needed to meet the modern requirements of this connected world. We look forward to extending our footprint with Oracle Cloud Infrastructure to take advantage of the new Toronto region.” – Greg Batenburg, V.P. Business Development, Frozen Mountain

“LifeLabs continually harness its medical expertise to build the best test offering while investing in technology to transform the delivery of health care. We do provide value to the healthcare system while remaining responsive to changing environments and needs. As a trusted partner supporting patients, healthcare providers, corporate customers and entire provincial healthcare systems, we are working with Oracle to utilize the power of the cloud to innovate and ensure we create a culture of continuous improvement. We look forward to the continued collaboration with Oracle.”  Husam Shublaq, Director for Application Services, LifeLabs

“At McMaster, we strive for educational and operational excellence. We are excited about the opening of Oracle’s Toronto region. As we deliver on the McMaster IT Strategy and focus on providing the best possible solutions for our campus community, including students, researchers and faculty, we appreciate having additional delivery models to consider, including Oracle's cloud technology.” – Kevin de Kock, Director of Enterprise Solutions and Applications, McMaster University

“Seneca is focused on providing our students with great education that they will need to navigate and thrive in this exciting and evolving world. Utilizing all the cloud has to offer will help us continue to deliver our students and faculty with the technology they need to ensure the best education. We are thrilled about the new Toronto region Oracle is opening, and look forward to continuing to leverage it.” – Radha Krishnan, Chief Information Officer, Seneca College

“There is a huge demand from the Canada market for a public cloud offering customers can trust and one that meets the data sovereignty requirements in the region. We’re excited to see Oracle Cloud Infrastructure launching in Canada as a local region is paramount to Canadian organizations. We’re looking forward to working with Oracle to leverage this new region and help ensure we are delivering customers with a cloud that can handle the most complicated workloads.” – Ravi Reddy, CEO & President, SuneraTech

Contact Info
Danielle Tarp
Oracle
+1.650.506.2905
danielle.tarp@oracle.com
Nicole Maloney
Oracle
+1.650.506.0806
nicole.maloney@oracle.com
About Oracle Cloud Infrastructure

Oracle Cloud Infrastructure is an enterprise Infrastructure as a Service (IaaS) platform. Companies of all sizes rely on Oracle Cloud to run enterprise and cloud native applications with mission-critical performance and core-to-edge security. By running both traditional and new workloads on a comprehensive cloud that includes compute, storage, networking, database, and containers, Oracle Cloud Infrastructure can dramatically increase operational efficiency and lower total cost of ownership. For more information, visit https://cloud.oracle.com/iaas

About Oracle

The Oracle Cloud offers complete SaaS application suites for ERP, HCM and CX, plus best-in-class database Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) from data centers throughout the Americas, Europe and Asia. For more information about Oracle (NYSE:ORCL), please visit us at www.oracle.com.

Trademarks

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Talk to a Press Contact

Danielle Tarp

  • +1.650.506.2905

Nicole Maloney

  • +1.650.506.0806

Deploying SQL Server 2019 AGs on K8s with helm charts

Yann Neuhaus - Thu, 2019-01-17 07:00

This write-up follows my first article about helm chart with SQL Server. This time, I would like to cover the availability groups topic and how to deploy them with helm charts.

blog 151 - 0 - banner

In fact, to go through this feature for AGs was motivated to its usage in our Azure DevOps CI pipeline in order to deploy a configurable one on an AKS cluster with SQL Server 2019.

151 - 1 - DevOpsAzureHelm

If you look carefully at the release pipeline, Windocks is also another product we are using for our integration testing with SQL Server containers and I will probably explain more on this topic in a future blog post. But this time I would like to share some experiences with the construction of the AG helm chart.

First of all let’s precise I used the content provided by Microsoft on GitHub to deploy availability groups on K8s. This is a new functionality of SQL Server 2019 and we run actually with CTP 2.1 version. Chances are things will likely change over the time and I may bet Microsoft will release their own helm chart in the future. Anyway, for me it was an interesting opportunity to deep dive in helm charts feature.

First step I ran into was the parametrization (one big interest of Helm) of the existing template with input values including AG’s name, image container repository and tag used for deployment and different service settings like service type, service port and target service port.

Here one of my values.yaml file:

# General parameters
agname: ag1
acceptEula: true
# Container parameters
agentsContainerImage:
  repository: mcr.microsoft.com/mssql/ha
  tag: 2019-CTP2.1-ubuntu
  pullPolicy: IfNotPresent
sqlServerContainer:
  repository: mcr.microsoft.com/mssql/server
  tag: 2019-CTP2.1-ubuntu
  pullPolicy: IfNotPresent
# Service parameters
sqlservice:
  type: LoadBalancer
  port: 1433
agservice:
  type: LoadBalancer
  port: 1433

 

As a reminder, services on K8s are a way to expose pods to the outside world. I also introduced some additional labels for the purpose of querying the system. This is basically the same labels used in the stable template on GitHub.

labels:
    …
    app: {{ template "dbi_mssql_ag.name" . }}
    chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}

 

Here a sample of my ag-sqlserver-deployment.yaml file with parametrization stuff:

apiVersion: mssql.microsoft.com/v1
kind: SqlServer
metadata:
  labels:
    name: mssql1
    type: sqlservr
    app: {{ template "dbi_mssql_ag.name" . }}
    chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
  name: mssql1
  namespace: {{ .Release.Namespace }}
spec:
  acceptEula: true
  agentsContainerImage: {{ .Values.agentsContainerImage.repository }}:{{ .Values.agentsContainerImage.tag }}
  availabilityGroups: [{{ .Values.agname }}]
  instanceRootVolumeClaimTemplate:
    accessModes: [ReadWriteOnce]
    resources:
      requests: {storage: 5Gi}
    storageClass: default
  saPassword:
    secretKeyRef: {key: sapassword, name: sql-secrets}
  masterKeyPassword:
    secretKeyRef: {key: masterkeypassword, name: sql-secrets} 
  sqlServerContainer: {image: '{{ .Values.sqlServerContainer.repository }}:{{ .Values.sqlServerContainer.tag }}'}

 

In addition, for a sake of clarity, I also took the opportunity to break the YAML files provided by Microsoft into different pieces including AG operator, AG RBAC configuration (security), AG instances and resources and AG services files. But you may wonder (like me) how to control the ordering of object’s creation? Well, it is worth noting that Helm collects all of the resources in a given Chart and it’s dependencies, groups them by resource type, and then installs them in a pre-defined order.

Let’ say I also removed the existing namespace’s creation from the existing YAML file because currently helm charts are not able to create one if it doesn’t exist. From a security perspective my concern was to keep the release deployment under control through helm charts so I preferred to precise the namespace directly in the –namespace parameter as well as creating the sql-secrets object which contains both sa and master key passwords manually according to the Microsoft documentation. In this way, you may segregate the permissions tiller has on a specific namespace and in my case, tiller has access to the ci namespace only.

Here comes likely the most interesting part. During the first deployment attempt, I had to face was the dependency that exists between the AG operator, SQL Server and AG resources as stated here:

The operator implements and registers the custom resource definition for SQL Server and the Availability Group resources.

The custom resource definition (CRD) is one of the first components with the deployment of the AG operator. You may retrieve an API service v1.mssql.microsoft.com as show below:

$ kubectl describe apiservice v1.mssql.microsoft.com
Name:         v1.mssql.microsoft.com
Namespace:
Labels:       kube-aggregator.kubernetes.io/automanaged=true
Annotations:  <none>
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
  Creation Timestamp:  2019-01-17T01:45:19Z
  Resource Version:    283588
  Self Link:           /apis/apiregistration.k8s.io/v1/apiservices/v1.mssql.microsoft.com
  UID:                 8b90159d-19f9-11e9-96ba-ee2da997daf5
Spec:
  Group:                   mssql.microsoft.com
  Group Priority Minimum:  1000
  Service:                 <nil>
  Version:                 v1
  Version Priority:        100
Status:
  Conditions:
    Last Transition Time:  2019-01-17T01:45:19Z
    Message:               Local APIServices are always available
    Reason:                Local
    Status:                True
    Type:                  Available
Events:                    <none>

 

Then the API is referenced in the YAML file that contains the definition of SQL Server resource objects through the following elements:

apiVersion: mssql.microsoft.com/v1
kind: SqlServer

 

As you probably guessed, if this API is missing on your K8s cluster at the moment of installing the AG resources you’ll probably face the following error message:

Error: [unable to recognize “”: no matches for kind “SqlServer” in version “mssql.microsoft.com/v1″, unable to recognize “”: no matches for kind “SqlServer” in version “mssql.microsoft.com/v1″, unable to recognize “”: no matches for kind “SqlServer” in version “mssql.microsoft.com/v1″]

At this stage, referring to the Helm documentation, I decided to split my initial release deployment into 2 separate helm charts. Between the 2 suggested methods in the documentation I much prefer this one because updating / removing releases is little bit easier but at the cost of introducing an additional chart in the game. With the CRD hook method, the CRD is not attached to a specific chart deployment, so if we need to change something in the CRD, it doesn’t get updated in the cluster unless we tear down the chart and install it again. This also means that we can’t add a CRD to a chart that has already been deployed. Finally, I took a look at the charts dependency feature but it doesn’t fix my issue at all because chart validation seems to come before the completion of the custom API. This is at least what I noticed with the current version of my helm version (v2.12.1). Probably one area to investigate for Microsoft …

So let’s continue …  Here the structure of my two helm charts (respectively for my AG resources and my AG operator).

$ tree /f

───dbi_mssql_ag
│   │   .helmignore
│   │   Chart.yaml
│   │   values.yaml
│   │
│   ├───charts
│   └───templates
│       │   ag-services.yaml
│       │   ag-sqlserver-deployment.yaml
│       │   NOTES.txt
│       │   _helpers.tpl
│       │
│       └───tests
└───dbi_mssql_operator
    │   .helmignore
    │   Chart.yaml
    │   values.yaml
    │
    ├───charts
    └───templates
        │   ag-operator-deployment.yaml
        │   ag-security.yaml
        │   NOTES.txt
        │   _helpers.tpl
        │
        └───tests

 

The deployment consists in deploying the two charts in the correct order:

$ helm install --name ag-2019-o --namespace ci .\dbi_mssql_operator\
…
$ helm install --name ag-2019 --namespace ci .\dbi_mssql_ag\
…
$ helm ls
NAME            REVISION        UPDATED                         STATUS          CHART           APP VERSION     NAMESPACE
ag-2019         1               Wed Jan 16 23:48:33 2019        DEPLOYED        dbi-mssql-ag-1  2019.0.0        ci
ag-2019-o       1               Wed Jan 16 23:28:17 2019        DEPLOYED        dbi-mssql-ag-1  2019.0.0        ci
…
$ kubectl get all -n ci
NAME                                 READY     STATUS      RESTARTS   AGE
pod/mssql-initialize-mssql1-hb9xs    0/1       Completed   0          2m
pod/mssql-initialize-mssql2-47n99    0/1       Completed   0          2m
pod/mssql-initialize-mssql3-67lzn    0/1       Completed   0          2m
pod/mssql-operator-7bc948fdc-45qw5   1/1       Running     0          22m
pod/mssql1-0                         2/2       Running     0          2m
pod/mssql2-0                         2/2       Running     0          2m
pod/mssql3-0                         2/2       Running     0          2m

NAME                  TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)             AGE
service/ag1           ClusterIP      None           <none>          1433/TCP,5022/TCP   2m
service/ag1-primary   LoadBalancer   10.0.62.75     xx.xx.xxx.xx   1433:32377/TCP      2m
service/mssql1        LoadBalancer   10.0.45.155    xx.xx.xxx.xxx   1433:31756/TCP      2m
service/mssql2        LoadBalancer   10.0.104.145   xx.xx.xxx.xxx       1433:31285/TCP      2m
service/mssql3        LoadBalancer   10.0.51.142    xx.xx.xxx.xxx       1433:31002/TCP      2m

NAME                             DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mssql-operator   1         1         1            1           22m

NAME                                       DESIRED   CURRENT   READY     AGE
replicaset.apps/mssql-operator-7bc948fdc   1         1         1         22m

NAME                      DESIRED   CURRENT   AGE
statefulset.apps/mssql1   1         1         2m
statefulset.apps/mssql2   1         1         2m
statefulset.apps/mssql3   1         1         2m

NAME                                DESIRED   SUCCESSFUL   AGE
job.batch/mssql-initialize-mssql1   1         1            2m
job.batch/mssql-initialize-mssql2   1         1            2m
job.batch/mssql-initialize-mssql3   1         1            2m

 

After that, my AG is deployed and ready!

Let me know if you want to try it and feel free to comment!. Apologize I don’t use a GitHub repository so far but things may be changing this year :)

Bear in mind  SQL Server 2019 is still in CTP version at the moment of this write-up. Things may change more or less until the first GA …

Happy AG deployment on K8s!

Cet article Deploying SQL Server 2019 AGs on K8s with helm charts est apparu en premier sur Blog dbi services.

Oracle OpenWorld Europe: London 2019

Yann Neuhaus - Thu, 2019-01-17 06:35

Oracle 19c, Oracle Cloud in numbers and pre-build environments on vagrant, docker and Oracle Linux Cloud Native environment, those were some of the topics at Open World Europe Conference in London. To see what’s behind with links to detailed sources, please read on.

The conference was organised by Oracle, most of the speakers were Oracle employees and introduced the audience at a high level into the Oracle ecosphere. To have an overview about what (huge portfolio) Oracle offers to the market and to get in touch with Oracle employees, open world conferences are the place to be. Many statements have already given at the bigger sister conference in San Francisco in October 2018, so European customers are the target audience for the conference in London. Most information on upcoming features given fall under save harbour statement, so one should be careful to take decisions based on these.

Oracle 19c

The main target in release 19 is stability, so fewer new features were added as in previous releases. Many new features are very well described on a post by our former dbi colleague Frank Pachot.

To get a deeper view into new features of every RDBMS release, a good source is to read the new features guides:

livesql.oracle.com is now running on Oracle 19, two demos on SQL functions introduced in 19c are available:

If you like to test Oracle 19c, you can participate in Oracle 19c beta program.

Oracle Cloud in numbers
  • 29’000’000+ active users
  • 25’000 customers
  • 1’075 PB storage
  • 83’000 VMs at 27 data centers
  • 1’600 operators

Reading these numbers, it’s obvious Oracle gains knowhow in Cloud environments and also understands better requirements building up private cloud environments at customers. It would be interesting to see what Oracle offers to small and medium sized companies.

Oracle Linux Cloud Native environment

Building up a stack with tools for DevOps teams can be very challenging for organizations:

  • huge effort
  • hard to find expert resources
  • no enterprise support
  • complex architectural bets

Thats why Oracle build up an stack on Oracle Linux that can be used both for dev, test and productive environments. Some features are:

The stack can be run in the cloud as well as on premises using Oracle Virtualbox. Since release 6, Virtualbox is able to move VMs to the Oracle Cloud.

Pre-build, ready-to-use environments on VirtualBox and Docker

It’s good practise to use Vagrant for fast Virtualbox provisioning. There are a couple of pre-build so called “Vagrant boxes” available by Oracle in their yum and githup repository.

If you want to test on pre-build oracle database environments (single instance, Real Application Cluster, Data Guard), Tim Hall provides Vagrant boxes for various releases.

If you are looking for pre-build docker containers, have a look at Oracle Container Registry.

Oracle strategy

To provide a pre-build stack follows a superior Oracle strategy: IT professionals should not deal with basic work (provisioning, patching, basic tuning), but concentrate on other, more important tasks. That why Oracle offers engineered systems and cloud services as a basis. What the more important subjects are was explained in a session about “the changing role of the DBA”.

Architecture Security
  • No insecure passwords
  • Concept work: who should have access to what and in which context?
  • Analyse privileges with the help of DBMS_PRIVILEGE_CAPTURE package
  • Data masking/redaction in test and dev environment
Availability Understand SQL

A personal highlight was the session from Chris R. Saxon which is as specialist for SQL. His presentation style is not only very entertaining, but also the content is interesting and helps you get the most out of Oracle Database engine. In Chris’ session, he explained why sql queries tend not to use indexes even there are present and do Full table scans instead. This is always bad and mainly based on clustering factor and data cardinality. You can follow his presentation on Youtube:

You can find more video content from Chris on his Youtube channel.

If you are interested to learn SQL, another great source is the Oracle SQL blog.

Cet article Oracle OpenWorld Europe: London 2019 est apparu en premier sur Blog dbi services.

Hint Reports

Jonathan Lewis - Thu, 2019-01-17 03:59

Nigel Bayliss has posted a note about a frequently requested feature that has now appeared in Oracle 19c – a mechanism to help people understand what has happened to their hints.  It’s very easy to use, it’s just another format option to the “display_xxx()” calls in dbms_xplan; so I thought I’d run up a little demonstration (using an example I first generated 18 years and 11 versions ago) to make three points: first, to show the sort of report you get, second to show you that the report may tell you what has happened, but that doesn’t necessarily tell you why it has happened, and third to remind you that you should have stopped using the /*+ ordered */ hint 18 years ago.

I’ve run the following code on livesql:


rem
rem     Script:         c_ignorehint.sql
rem     Author:         Jonathan Lewis
rem     Dated:          March 2001
rem


drop table ignore_1;
drop table ignore_2;

create table ignore_1
nologging
as
select
        rownum          id,
        rownum          val,
        rpad('x',500)   padding
from    all_objects
where   rownum <= 3000
;

create table ignore_2
nologging
as
select
        rownum          id,
        rownum          val,
        rpad('x',500)   padding
from    all_objects
where   rownum <= 500
;

alter table ignore_2
add constraint ig2_pk primary key (id);


explain plan for
update
        (
                select
                        /*+
                                ordered
                                use_nl(i2)
                                index(i2,ig2_pk)
                        */
                        i1.val  val1,
                        i2.val  val2
                from
                        ignore_1        i1,
                        ignore_2        i2
                where
                        i2.id = i1.id
                and     i1.val <= 10
        )
set     val1 = val2
;

select * from table(dbms_xplan.display(null,null,'hint_report'));

explain plan for
update
        (
                select
                        /*+
                                use_nl(i2)
                                index(i2,ig2_pk)
                        */
                        i1.val  val1,
                        i2.val  val2
                from
                        ignore_1        i1,
                        ignore_2        i2
                where
                        i2.id = i1.id
                and     i1.val <= 10
        )
set     val1 = val2
;

select * from table(dbms_xplan.display(null,null,'hint_report'));

As you can see I’ve simply added the format option “hint_report” to the call to dbms_xplan.display(). Before showing you the output I’ll just say a few words about the plans we might expect from the two versions of the update statement.

Given the /*+ ordered */ hint in the first statement we might expect Oracle to do a full tablescan of ignore_1 then do a nested loop into ignore_2 (obeying the use_nl() hint) using the (hinted) ig2_pk index. In the second version of the statement, and in the absence of the ordered hint, it’s possible that the optimizer will still use the same path but, in principle, it might find some other path.

So what do we get ? In order here are the two execution plans:


Plan hash value: 3679612214
 
--------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT                      |          |    10 |   160 |   111   (0)| 00:00:01 |
|   1 |  UPDATE                               | IGNORE_1 |       |       |            |          |
|*  2 |   HASH JOIN                           |          |    10 |   160 |   111   (0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| IGNORE_2 |   500 |  4000 |    37   (0)| 00:00:01 |
|   4 |     INDEX FULL SCAN                   | IG2_PK   |   500 |       |     1   (0)| 00:00:01 |
|*  5 |    TABLE ACCESS STORAGE FULL          | IGNORE_1 |    10 |    80 |    74   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("I2"."ID"="I1"."ID")
   5 - storage("I1"."VAL"<=10)
       filter("I1"."VAL"<=10)
 
Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 3 (U - Unused (1))
---------------------------------------------------------------------------
   1 -  SEL$DA9F4B51
           -  ordered
 
   3 -  SEL$DA9F4B51 / I2@SEL$1
         U -  use_nl(i2)
           -  index(i2,ig2_pk)




Plan hash value: 1232653668
 
------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT              |          |    10 |   160 |    76   (0)| 00:00:01 |
|   1 |  UPDATE                       | IGNORE_1 |       |       |            |          |
|   2 |   NESTED LOOPS                |          |    10 |   160 |    76   (0)| 00:00:01 |
|   3 |    NESTED LOOPS               |          |    10 |   160 |    76   (0)| 00:00:01 |
|*  4 |     TABLE ACCESS STORAGE FULL | IGNORE_1 |    10 |    80 |    74   (0)| 00:00:01 |
|*  5 |     INDEX UNIQUE SCAN         | IG2_PK   |     1 |       |     0   (0)| 00:00:01 |
|   6 |    TABLE ACCESS BY INDEX ROWID| IGNORE_2 |     1 |     8 |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - storage("I1"."VAL"<=10)
       filter("I1"."VAL"<=10)
   5 - access("I2"."ID"="I1"."ID")
 
Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   5 -  SEL$DA9F4B51 / I2@SEL$1
           -  index(i2,ig2_pk)
           -  use_nl(i2)

As you can see, the “Hint Report” shows us how many hints have been seen in the SQL text, then the body of the report shows us which query block, operation and table (where relevant) each hint has been associated with, and whether it has been used or not.

The second query has followed exactly the plan I predicted for the first query and the report has shown us that Oracle noted, and used, the use_nl() and index() hints to access table ignore2, deciding for itself to visit the tables in the order ignore_1 -> ignore_2, and doing a full tablescan on ignore_1.

The first query reports three hints, but flags the use_nl() hint as unused. (There is (at least) one other flag that could appear against a hint – “E” for error (probably syntax error), so we can assume that this hint is not being ignored because there’s something wrong with it.) Strangely the report tells us that the optimizer has used the ordered hint but we can see from the plan that the tables appear to be in the opposite order to the order we specified in the from clause, and the chosen order has forced the optimizer into using an index full scan on ig2_pk because it had to obey our index() hint.  Bottom line – the optimizer has managed to find a more costly plan by “using but apparently ignoring” a hint that described the cheaper plan that we would have got if we hadn’t used the hint.

Explanation

Query transformation can really mess things up and you shouldn’t be using the ordered hint.

I’ve explained many times over the years that the optimizer evaluates the cost of an update statement by calculating the cost of selecting the rowids of the rows to be updated. In this case, which uses an updatable join view, the steps taken to follow this mechanism this are slightly more complex.  Here are two small but critical extracts from the 10053 trace file (taken from an 18c instance):


CVM:   Merging SPJ view SEL$1 (#0) into UPD$1 (#0)
Registered qb: SEL$DA9F4B51 0x9c9966e8 (VIEW MERGE UPD$1; SEL$1; UPD$1)

...

SQE: Trying SQ elimination.
Query after View Removal
******* UNPARSED QUERY IS *******
SELECT
        /*+ ORDERED INDEX ("I2" "IG2_PK") USE_NL ("I2") */
        0
FROM    "TEST_USER"."IGNORE_2" "I2",
        "TEST_USER"."IGNORE_1" "I1"
WHERE   "I2"."ID"="I1"."ID"
AND     "I1"."VAL"<=10


The optimizer has merged the UPDATE query block with the SELECT query block to produce a select statement that will produce the necessary plan (I had thought that i1.rowid would appear in the select list, but the ‘0’ will do for costing purposes). Notice that the hints have been preserved as the update and select were merged but, unfortunately, the merge mechanism has reversed the order of the tables in the from clause. So the optimizer has messed up our select statement, then obeyed the original ordered hint!

Bottom line – the hint report is likely to be very helpful in most cases but you will still have to think about what it is telling you, and you may still have to look at the occasional 10053 to understand why the report is showing you puzzling results. You should also stop using a hint that was replaced by a far superior hint more than 18 years ago – the ordered hint in my example should have been changed to /*+ leading(i1 i2) */ in Oracle 9i.

Why we need httpd service running on 1st node of Oracle BDA (Big Data Appliance)?

Alejandro Vargas - Thu, 2019-01-17 03:37

Whenever we run a BDA upgrade we see as one of the prerequisites checks, to have the httpd service up and running on the first node of the cluster

- The httpd service must be started on Node 1 of the cluster. It should be stopped on all other nodes.

This is only needed during cluster upgrade or reconfiguration, but it is recommended to have it running at all times.

The reason for that is because we will run yum from every node in the cluster pointing to the repository that is created on node 1 when we deploy the bundle with the new release.

On /etc/yum.repos.d/bda.repo we can see that baseurl is using the first node and http protocol:

-  baseurl=http://<node01.domain>/bda

The repository is used on the event of upgrades and patching, and during these operation the httpd service must be up on node 1 of the cluster.

During normal operation, whenever upgrades or patching is not being run, the http service can be brought down if desired.

 

Reference:

MOS Upgrade documents

ie: 

Upgrading Oracle Big Data Appliance(BDA) CDH Cluster to V4.13.0 from V4.9, V4.10, V4.11, V4.12 on OL6 and from V4.12 on OL7 Releases using Mammoth Software Upgrade Bundle for OL6 or OL7 (Doc ID 2451940.1)
Categories: DBA Blogs

Study: 89 Percent of Finance Teams Yet to Embrace Artificial Intelligence

Oracle Press Releases - Thu, 2019-01-17 02:00
Press Release
Study: 89 Percent of Finance Teams Yet to Embrace Artificial Intelligence Report by Association of International Certified Professional Accountants and Oracle shows 90 percent of finance teams do not have the skills to support digital transformation

Oracle OpenWorld London, United Kingdom —Jan 17, 2019

Finance teams lack the digital skillset to embrace the latest advancements in artificial intelligence, causing a negative impact on revenue growth, according to a new study from the Association of International Certified Professional Accountants (the Association) and Oracle (NYSE: ORCL). The study of more than 700 global finance leaders found that despite a clear correlation between the deployments of AI and revenue growth, 89 percent of organizations have not deployed AI in the finance function and only 10 percent of finance teams believe they have the skills to support the organization’s digital ambitions.

The report, titled “Agile Finance Unleashed: The Key Traits of Digital Finance Leaders highlights that 46 percent of tech-savvy finance leaders report positive revenue growth, compared with only 29 percent of tech-challenged leaders. Furthermore, organizations that have seen revenue growth are more likely to be deploying artificial intelligence compared to those where revenues are flat or declining. However, only 11 percent of finance leaders surveyed have implemented artificial intelligence in the finance function, and 90 percent say their finance team does not have skills to support enterprise digital transformation.

“For me, robotic process automation, advanced analytics, and machine learning are three legs of the same stool,” said John Merino, chief accounting officer at FedEx. “The combination of those technologies and the ability to deliver them in an agile manner without long lead times and extensive interface complexities creates a tremendous opportunity to capitalize on some really big efficiency gains in virtually every staff function. The big win for us is to liberate that time and move finance up the value chain in what it delivers to the organization.”

“Businesses are missing out on huge growth potential by failing to give finance teams the tools and training they need to make better corporate decisions,” said Andrew Harding FCMA, CGMA, chief executive of management accounting at the Association of International Certified Professional Accountants. “Cloud and emerging technologies like AI and blockchain drive efficiency and improve insight and accuracy, enabling finance leaders to step into a more strategic role in the business and improve the organization’s data-driven decision making. To make the most of these new technologies, finance teams need to simultaneously evolve the competencies of their staff in areas such as analytical thinking, decision-making and business partnering.”

Key Traits of Digital Finance Leaders: The report identifies three common traits of tech-savvy finance teams:

  • Modern Business Processes: According to the report, tech-savvy finance leaders use advanced technologies and establish ‘operational excellence’. For example, 86 percent of Digital Finance Leaders have a digital-first and cloud-first mindset, which gives them greater access to intelligent process automation and technologies such as artificial intelligence and Blockchain, which are commonly delivered via the cloud. Additionally, 73 percent centralize finance subject matter expertise in a global ‘Center of Excellence.’
  • Data Insights: Leading finance teams are able to connect data that was previously in disparate applications to uncover new insights. They are increasingly relying on AI to uncover hidden patterns, make recommendations, and learn continually from the non-stop flow of business data. The report shows that organizations that have seen positive revenue growth are more likely to be deploying AI compared to those where revenues are flat or have declined.
  • Business Influence: Leading finance teams have been able to move beyond reporting and are using data-driven insights to influence the direction of the business. With reduced time spent on manual reporting processes and armed with accurate and timely data, finance leaders are empowered to partner with the business, recommend new courses of action and influence business strategy.

“The cloud has significantly reduced the barrier to emerging technologies and is enabling organizations to introduce new business models and unique customer experiences that drive additional revenue streams,” says Kimberly Ellison-Taylor CPA, CGMA, global strategy leader, Cloud Business Group, Oracle and former chair of the Association and the American Institute of CPAs (AICPA). “Common benefits that our customers experience once in the cloud include reduced costs and improved efficiency, increased security, real-time and accurate reporting, deeper business insight and better decision-making. The confluence of benefits enables organizations to spend less time on low-value, time-intensive reporting and innovate faster than their competitors.”

The Association conducted in-depth interviews with CFOs and other top finance executives at companies around the globe. Included in that extensive and diverse group of participants were the following Oracle customers:

  • Addiko Bank
  • Blue Cross Blue Shield of Michigan
  • Highmark Health
  • Hungry Jacks
  • Instacart
  • RedMart
  • Rolls Royce
  • Royal Bank of Scotland
  • Stitch Fix
  • USEN Corporation
  • Western Digital

The full report can be downloaded at http://www.oracle.com/goto/agilefinance.

Contact Info
Bill Rundle
Oracle PR
1.650.506.1891
bill.rundle@oracle.com
Colette Sharbaugh
Association PR
1.202.434.9212
colette.sharbaugh@aicpa-cima.com
About the Association of International Certified Professional Accountants

The Association of International Certified Professional Accountants (the Association) is the most influential body of professional accountants, combining the strengths of the American Institute of CPAs (AICPA) and the Chartered Institute of Management Accountants (CIMA) to power opportunity, trust and prosperity for people, businesses and economies worldwide. It represents 667,000 members and students in public and management accounting and advocates for the public interest and business sustainability on current and emerging issues. With broad reach, rigor and resources, the Association advances the reputation, employability and quality of CPAs, CGMA designation holders and accounting and finance professionals globally.

About Oracle

The Oracle Cloud offers complete SaaS application suites for ERP, HCM and CX, plus best-in-class database Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) from data centers throughout the Americas, Europe and Asia. For more information about Oracle (NYSE:ORCL), please visit us at www.oracle.com.

Trademarks

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Talk to a Press Contact

Bill Rundle

  • 1.650.506.1891

Colette Sharbaugh

  • 1.202.434.9212

Eno Future Proofs Oracle Health Insurance Back Office

Oracle Press Releases - Wed, 2019-01-16 12:38
Press Release
Eno Future Proofs Oracle Health Insurance Back Office Health insurer poised to provide customers with optimal service while creating internal efficiencies at a lower cost

Redwood Shores, Calif.—Jan 16, 2019

Utrecht, 16 January 2019 - Health insurer Eno, based in the Netherlands, is switching to Oracle Health Insurance Back Office (OHI), a preferred solution in the market that will provide critical flexibility to its healthcare IT infrastructure.

Eno sought a replacement of its outdated back-office system that would meet its own requirements and those of regulators, while adding strategic value to the organization. It also needed a stable and future-proof solution that could scale with a number of products in its own special care product portfolio. Eno considered various options including custom development and the use of existing COTS solutions, but only OHI met Eno's overall requirements, and is a proven solution in the market to administer policies, process claims and financial settlement.

Theo Engels, director ICT and Care at Eno, noted, "To continue to be successful as a health insurer in the future, Eno wants to transform into an organisation with a digital heart. We want to adapt processes and systems to the requirements of our time and to focus on an optimal digital customer experience, both in sales and services.”

"With Oracle Health Insurance we offer Eno not only a complete, flexible and future-proof application that enables them to comply with legislation and regulations, but also enables an intensive collaboration with other OHI users that leads to joint value creation and a lower total cost of ownership," says Jaco Geels, OHI Senior Director Product Strategy and Services, Oracle.

Eno has selected Truston Solutions, a Gold level member of the Oracle PartnerNetwork(OPN), for the implementation of Oracle Health Insurance. Eno intends to start using the new platform in the 2019/2020 migration season.

Contact Info
Judi Palmer
Oracle
+1 650 506 0266
judi.palmer@oracle.com
About Eno

Eno is a health insurer based in Deventer and with a strong regional position. The company, founded in 1860, strives every day to make a substantial contribution to the health of its members. Eno does this by providing them—now and in the future—with appropriate, affordable and accessible care and by inspiring and facilitating them in their pursuit of optimal health.

About Oracle Insurance

Oracle provides modern, innovative technology that enables insurers to drive their digital transformation strategy forward. With Oracle’s flexible, rules-based solutions, insurers can simplify IT by consolidating individual and group business on a single platform, quickly deploy differentiated products, and empower their ecosystem to improve engagement with customers and partners. Learn more at oracle.com/insurance.

About Oracle PartnerNetwork

Oracle PartnerNetwork (OPN) is Oracle’s partner program that provides partners with a differentiated advantage to develop, sell and implement Oracle solutions. OPN offers resources to train and support specialized knowledge of Oracle’s products and solutions and has evolved to recognize Oracle’s growing product portfolio, partner base and business opportunity. Key to the latest enhancements to OPN is the ability for partners to be recognized and rewarded for their investment in Oracle Cloud. Partners engaging with Oracle will be able to differentiate their Oracle Cloud expertise and success with customers through the OPN Cloud program – an innovative program that complements existing OPN program levels with tiers of recognition and progressive benefits for partners working with Oracle Cloud. To find out more visit: http://www.oracle.com/partners.

Trademarks

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Talk to a Press Contact

Judi Palmer

  • +1 650 506 0266

Critical Patch Update for January 2019 Now Available

Steven Chan - Wed, 2019-01-16 12:08

The Critical Patch Update (CPU) for January 2019 was released on 15 January 2019. Oracle strongly recommends applying the CPU patches as soon as possible.

New MOS Note for EBS CPUs

To make it easier for you to find the latest available Oracle E-Business Suite CPU document, we published a new document, Identifying the Latest Critical Patch Update for Oracle E-Business Suite Release 12 (Doc ID 2484000.1).  This new document will always contain a link to the latest available Oracle E-Business Suite CPU document. We recommend you bookmark My Oracle Support Knowledge Document 2484000.1 in your favorites list in My Oracle Support. 

    Oracle CPU Announcements

The Critical Patch Update Advisory is the starting point for reviewing all Oracle Critical Patch Update Advisories, Security Alerts and Bulletins. It includes a list of products affected, risk matrices of fixed vulnerabilities, and links to other important documents.

It is essential to review the Critical Patch Update supporting documentation referenced in the Advisory before applying patches.

The next four Critical Patch Update release dates are:

  • 16 April 2019
  • 16 July 2019
  • 15 October 2019
  • 14 January 2020
Additional References Related Articles
Categories: APPS Blogs

Oracle Java Card Boosts Security for IoT Devices at the Edge

Oracle Press Releases - Wed, 2019-01-16 08:00
Press Release
Oracle Java Card Boosts Security for IoT Devices at the Edge Java Card 3.1 enables multi-cloud authentication and quickly deployed security to connected devices

Redwood Shores Calif—Jan 16, 2019

Wearable and consumer electronics are increasingly used for sensitive applications such as Near Field Communication (NFC) ticketing and payments, as well as tracking health data.

Oracle today announced the general availability of Java Card 3.1, the latest version of one of the world’s most popular and open application platform used to secure some of the world’s most sensitive devices. This extensive update provides more flexibility to help meet the unique hardware and security requirements of both existing secure chips and emerging Internet of Things (IoT) technologies. New features introduced with this release address use cases across markets ranging from telecom and payments to cars and wearables.

Java Card technology provides a secured environment for applications that run on smart cards and other trusted devices with limited memory and processing capabilities. With close to six billion Java Card-based devices deployed each year, Java Card is already a leading software platform to run security services on smart cards and secure elements, which are chips used to protect smartphones, banking cards and government services.

Java Card introduces features that make applications more portable across security hardware critical to IoT. This enables new uses for hardware-based security, such as multi-cloud IoT security models, and makes Java Card the ideal solution for tens of billions of IoT devices that require security at the edge of the network.

Emerging Applications for Java Card include:

  • Smart meters and industrial IoT Increasingly sophisticated IoT smart meters and IoT gateways use Java Card to authenticate smart city and corporate services while protecting individual device credentials.
  • Wearables – Wearable and consumer electronics are increasingly used for sensitive applications such as Near Field Communication (NFC) ticketing and payments, as well as tracking health data. Java Card helps to meet the security requirements of these devices while allowing the flexibility to add and update services.
  • Automotive – Car manufacturers can use strong Java Card-based security to help protect vehicle systems and sensitive data from physical and network attacks.
  • Cloud connected devices – Java Card in connected devices can enable access to 5G or NBIoT networks and offer strong authentication for the IoT cloud.
 

“Connected devices’ volumes are expected to increase in the upcoming years, posing an increasingly complex challenge as growth adds system complexity to the infrastructure handling device data,” said Volker Gerstenberger, President and Chair of the Java Card Forum. “Java Card 3.1 is very significant to the Internet of Things, bringing interoperability, security and flexibility to a fast-growing market currently lacking high-security and flexible edge security solutions.”

New features and capabilities include:

  • Deployment of edge security services at IoT speed – Java Card 3.1 allows the development of security services that are portable across a wide range of IoT security hardware, helping reduce the risk and complexity of evolving IoT hardware and standards. A new extensible I/O model enables applications to exchange sensitive data directly with connected peripherals, over a variety of physical layers and application protocols.
  • Dedicated IoT features – Java Card 3.1 introduces new APIs and updated cryptography functions to help address the security needs of IoT and facilitate the design of security applications such as device attestation. Uniquely, Java Card in IoT devices enables deployment of security and connectivity services on the same chip. Multiple applications can be deployed on a single card and new ones can be added to it even after it has been deployed.
  • Developer enhancements – Java Card includes a set of unique tools for developing new services and applications. An extended file format simplifies application deployment, code upgrade and maintenance. API enhancements boost developer productivity and the memory efficiency of applications in secure devices.
 

“Java Card is already used and trusted as a leading security platform for countless devices in the multi-billion-dollar smart card and secure element industry,” said Florian Tournier, Senior Director for Java Card at Oracle. “The 3.1 release enables the rollout of security and SIM applications on the same chip, allowing those services to be used on a large spectrum of networks from NB-IoT to 5G, and on a wide range of devices.”

To learn more about Java Card technology, please visit https://www.oracle.com/technetwork/java/embedded/javacard/overview/index.html

Contact Info
Alex Shapiro
Oracle PR
+1-650-607-1591
alex.shapiro@oracle.com
About Oracle

The Oracle Cloud offers complete SaaS application suites for ERP, HCM and CX, plus best-in-class database Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) from data centers throughout the Americas, Europe and Asia. For more information about Oracle, please visit us at www.oracle.com.

Trademarks

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Talk to a Press Contact

Alex Shapiro

  • +1-650-607-1591

[Vlog] How to Connect to Autonomous Database on Oracle Cloud

Online Apps DBA - Wed, 2019-01-16 04:45

Autonomous databases are preconfigured, fully-managed environments that are suitable for either transaction processing or for data warehouse workloads. Want to know more? Visit: https://k21academy/clouddba40 and Consider our New Video blog on Must Now Things in order to connect to Autonomous Database on Oracle Cloud and… ✔ Database Service in Oracle Cloud Infrastructure ✔ Steps To […]

The post [Vlog] How to Connect to Autonomous Database on Oracle Cloud appeared first on Oracle Trainings for Apps & Fusion DBA.

Categories: APPS Blogs

Podcast: Database Golden Rules: When (and Why) to Break Them

OTN TechBlog - Tue, 2019-01-15 23:00

American inventor Thomas Edison once said, “Hell, there are no rules here. We're trying to accomplish something.”

What we hope to accomplish with this episode of the Groundbreaker Podcast is an exploration of the idea that the evolution in today’s architectures makes it advantageous, perhaps even necessary, to challenge some long-established concepts that have achieved “golden rule” status as they apply to the use of databases.

Does ACID (Atomicity, Consistency, Isolation, and Durability) still carry as much weight? In today’s environments, how much do rigorous data integrity enforcement, data normalization, and data freshness matter? This program explores those and other questions.

Bringing their insight and expertise to this discussion are three recognized IT professionals who regularly wrestle with balancing the rules with innovation. If you’ve struggled with that same balancing act, you’ll want to listen to this program.

The Panelists Listed alphabetically Heli Helskyaho Heli Helskyaho
CEO, Miracle Finland Oy, Finland
Oracle ACE Director
Twitter LinkedIn  Lucas Jellema Lucas Jellema
CTO, Consulting IT Architect, AMIS Services, Netherlands
Oracle ACE Director
Oracle Groundbreaker Ambassador
Twitter LinkedIn  Guido Schmutz Guido Schmutz
Principal Consultant, Technology Manager, Trivadis, Switzerland
Oracle ACE Director
Oracle Groundbreaker Ambassador
Twitter LinkedIn 

 

Additional Resources Coming Soon
  • Baruch Sadogursky, Leonid Igolnik, and Viktor Gamov discuss DevOps, streaming, liquid software, and observability in this podcast captured during Oracle Code One 2018
  • What's Up with Serverless? A panel discussion of where Serverless fits in the IT landscape.
  • JavaScipt Development and Oracle JET. A discussion of the evolution of JavaScript development and the latest features in Oracle JET.
Subscribe Never miss an episode! The Oracle Groundbreakers Podcast is available via: Participate

If you have a topic suggestion for the Oracle Groundbreakers Podcast, or if you are interested in participating as a panelist, please let me know by posting a comment. I'll get back to you right away.

Pages

Subscribe to Oracle FAQ aggregator