Apache PDFBOX - getting java.lang.OutOfMemoryError when using split(PDDocument document) -


i trying split document decent 300 pages using apache pdfbox api v2.0.2. while trying split pdf file single pages using following code:

        pddocument document = pddocument.load(inputfile);         splitter splitter = new splitter();         list<pddocument> splitteddocuments = splitter.split(document); //exception happens here 

i receive following exception

exception in thread "main" java.lang.outofmemoryerror: gc overhead limit exceeded 

which indicates gc taking time clear heap not justified amount reclaimed.

there numerous jvm tuning methods can solve situation, however, of these treating symptom , not real issue.

one final note, using jdk6, hence using new java 8 consumer not option in case.thanks

edit:

this not duplicate question of http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as:

  1. not have size problem mentioned in aforementioned     topic. slicing 270 pages 13.8mb pdf file , after slicing     size of each slice average of 80kb total size of     30.7mb.  2. split throws exception before returns splitted parts.

i found split can pass long not passing whole document, instead pass "batches" 20-30 pages each, job.

pdf box stores parts resulted split operation objects of type pddocument in heap objects, results in heap getting filled fast, , if call close() operation after every round in loop, still gc not able reclaim heap size in same manner gets filled.

an option split document split operation batches, in each batch relatively manageable chunk (10 40 pages)

public void execute() {     file inputfile = new file(path/to/the/file.pdf);     pddocument document = null;     try {         document = pddocument.load(inputfile);          int start = 1;         int end = 1;         int batchsize = 50;         int finalbatchsize = document.getnumberofpages() % batchsize;         int noofbatches = document.getnumberofpages() / batchsize;         (int = 1; <= noofbatches; i++) {             start = end;             end = start + batchsize;             system.out.println("batch: " + + " start: " + start + " end: " + end);             split(document, start, end);         }         // handling remaining         start = end;         end += finalbatchsize;         system.out.println("final batch  start: " + start + " end: " + end);         split(document, start, end);      } catch (ioexception e) {         e.printstacktrace();     } {         //close document     } }  private void split(pddocument document, int start, int end) throws ioexception {     list<file> filelist = new arraylist<file>();     splitter splitter = new splitter();     splitter.setstartpage(start);     splitter.setendpage(end);     list<pddocument> splitteddocuments = splitter.split(document);     string outputpath = config.instance.getproperty("outputpath");     pdftextstripper stripper = new pdftextstripper();      (int index = 0; index < splitteddocuments.size(); index++) {         string pdffullpath = document.getdocumentinformation().gettitle() + index + start+ ".pdf";         pddocument splitteddocument = splitteddocuments.get(index);          splitteddocument.save(pdffullpath);     } } 

Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -