LP Stage 2

In this stage, we will be adding OCR processing of uploaded pages, and allowing the BookReader to search the pages of the current book for text.  The tesseract program will be used for the OCR process.

You can download a backup image of the Stage 2 system here (you will need to run sudi raspi-config to regain SD card space):

You can download the Stage2 PHP files here:

Please note that the image above was taken when the sample book was uploaded and about half done OCR’ing.  This way you can confirm the scheduling is all working correctly as the book eventually completes.

SQL script for stage 2

 

Install Tesseract

Tesseract is the open source OCR package we will be using.  It’s another one-line install:

sudo apt-get install tesseract-ocr

To test it, download our sample file:

wget http://www.librarypi.com/downloads/Sample.jpg

Execute the following command to test tesseract:

tesseract Sample.jpg Sample

This may take several minutes to complete, but eventually output should appear like this:

tesseract1

This should have output a Sample.txt file, so look at the top of that file:

head Sample.txt

tesseract2

Now, compare that text with the actual Sample.jpg file, and we see tesseract did a decent job:

tesseract3

MySQL Changes

We need to add some new tables to our MySQL database to support the words on the pages.  We’re going to create some new tables: a lp_word table to hold unique words, and an lp_page_word table to hold a list of the words on each page.  The tables are defined as follows:

create table if not exists lp_word(
 id integer  NOT NULL AUTO_INCREMENT,
 word varchar(32),
 PRIMARY KEY PK_lp_word (id),
 INDEX ilp_word (word)
);
create table if not exists lp_page_word(
 id integer  NOT NULL AUTO_INCREMENT,
 page_id integer,
 word_id integer,
 seq integer,
 posleft integer,
 postop integer,
 posright integer,
 posbottom integer,
 PRIMARY KEY PK_lp_page_word (id),
 INDEX ilp_page_word_page (page_id),
 INDEX ilp_page_word_word (word_id)
);

Using MySQL_client or MySQL Workbench, execute the statements above in your database.

lp.php – Add OCR data to database

We need to modify the PHP ‘LP’ class to add the ability to OCR text using tesseract and store the resulting words in the database for searching.  We’re also going to add methods that help the BookReader support text searching.

Method OCRFiles
This method will search the lp_pages table for any records with ‘N’ status.  It then invokes tesseract on that file to create the hocr file, and then set’s the file status to ‘O’ indicating it was OCR’ed.

 function OCRFiles( ) 
 {
  $Ret = 0;
  $Cmd = 'select p.id, filename '.
   ' from lp_page p '.
   ' where status = \'N\'';
  $result = mysqli_query( db(), $Cmd );
  if ( $row = mysqli_fetch_array( $result ) )
  {
   $ID = $row["id"];
   $Filename = $row["filename"];
   // Create same file with .hocr extension appended
   exec( "tesseract $Filename $Filename hocr" ); 
   if( file_exists( $Filename.".hocr" ) )
   { // Update status to 'O' (for OCR'ed)
      mysqli_query( db(), "update lp_page set status='O' where id = $ID" );
      $Ret++;
   }
  }
  $result->close();
  return $Ret;   
 }

Method AddOCRRecords
This method will look gor lp_page records with ‘O’ status.  It will then load the corresponding hocr file for the image, and parse it into words.  It calls GetWordID to add the word if needed to the database, and then records it’s position on the page into the lp_page_word table.

function AddOCRRecords()
 {
 $Ret = 0;
 $Cmd = 'select id, filename '.
 ' from lp_page '.
 'where status = \'O\'';
 $result = mysqli_query( db(), $Cmd);
 $Max = 10;
 while ( $row = mysqli_fetch_array( $result ) )
 {
 $Max--;
 if( $Max == 0 )
 break;
 $PageID = $row["id"];
 $Filename = $row["filename"].'.hocr';
 $Text = file_get_contents( $Filename );
 $Seq = 0;
 while( ($P = strpos( $Text, "<span class='ocrx_word'" )) !== FALSE )
 {
 $Text = substr( $Text, $P+23 );
     $P = strpos( $Text, "</span>" );
 $Word = substr( $Text, 0, $P );
 $P = strpos( $Word, 'bbox'  );
 $Word = substr( $Word, $P+4 );
 $P = strpos( $Word, ';' );
 $Dim = substr( $Word, 0, $P  );
 $P = strpos( $Word, '>' );
 $Word = strip_tags( substr( $Word, $P+1 ) );
 $WordID = $this->GetWordID( $Word );
 if( $WordID > 0 )
 {
 $Seq++;
 $ar = explode( ' ', $Dim );
 $Cmd = "insert into page_word ( page_id, word_id, seq, posleft, postop, posright, posbottom ) values (".
 "$PageID, $WordID, $Seq, ".$ar[1].",".$ar[2].",".$ar[3].",".$ar[4].")"; // LTRB
 mysqli_query( db(), $Cmd );
 }
 }
 $Cmd = "update lp_page set status='I' where id = $PageID";
 mysqli_query( db(), $Cmd );
 $Ret++;
}
 $result->close();
 return $Ret;
 }

Method GetWord
This method is called to get the ID for a word.  If not found, it automatically adds it and returns the new ID.

 function GetWordID( $Word )
 {
  $Word =  $this->ProcessWord( $Word );
  if( strlen( $Word ) == 0 )
   return 0;
  $result = mysqli_query( db(), "select * from lp_word w where w.word = '$Word' ");
  if ( $row = mysqli_fetch_array( $result ) )
   $Ret = $row["id"];
  else
  {
   mysqli_query( db(), "insert into lp_word (word) values( '$Word') ");
   $Ret  = mysqli_insert_id ( db() );
  }
  $result->close();
  return $Ret;
 }

Method OutputSearchAndExit
This method will perform a search on the lp_page_word and related tables to locate words in pages.  It is used by the util.php file, which in turn is used by the BookReader component.  Together they allow the user to perform a text search on the uploaded books.

This method is not listed here, to keep the description simple. Leave a comment if you want more details.

New proc.php file

We are going to add a new file, proc.php, to our site.  We are going to setup a scheduled task to invoke this page with wget every minute.  The job of this script will be to look for pages that need to be OCR’ed or indexed.  Please note that this page runs for at least 50 seconds, and quite possibly longer.  If you call it up in your web browser, you must be VERY patient, and then it won’t actually output anything.  It’s only called by the scheduler, not normally in a browser.

<?php
 include_once( 'lp.php' );
 
 function DoProcess() {
  $lp = new LP();
  $lp->OCRFiles();
  $lp->AddOCRRecords();
 } // end of function DoProcess() 
 
    $fp = fopen( "uploads/file.flag","w");
    if (flock($fp, LOCK_EX)) {
  try {
   $StopAt = time() + 50; // Run for 50 seconds
   $Cnt = 0;
   while( $StopAt > time() )
   {
    DoProcess();
    sleep( 2 );
    $Cnt++;
   }
  } 
  catch( Exception $e ) 
  {
   echo "Exception: " . $e-> $e->getMessage();
  }
        
        flock($fp, LOCK_UN);
    } // No else needed
 fclose($fp);
?>

Starting scheduled job
In order to invoke proc.php every minute, we need to setup cron to call it.  We do it by running crontab -e to edit the cront table:

crontab -e

Then, in the file want to add this line, so that wget is called to launch the proc.php file:

* * * * * wget -q http://localhost/proc.php -O /dev/null

Once the schedule has started, you should see the indicator of ‘Pages to process’ slowly dropping:

proc

Note that this is still a Raspberry Pi, and it may take a minute or two for each page to process.  So be patient.

Testing Stage 2

Once all the files are in place, and the crontab modified so that proc.php is called, and enough time given to the pi to OCR and index a book, you should now be able to search for text in the book (BookReader):

Enter text, like Lerner in the search field and click the Go button:

search1

You should see an animation indicating the pages that contain the text:

search2

When you go to that page, you should see the search text highlighted in blue:

search3

In Stage 3, we will add user authentication and some security, as well as the ability to delete books.