{"id":23,"date":"2016-03-17T10:48:09","date_gmt":"2016-03-17T10:48:09","guid":{"rendered":"http:\/\/www.librarypi.com\/?page_id=23"},"modified":"2016-04-01T01:04:36","modified_gmt":"2016-04-01T01:04:36","slug":"lp-stage-2","status":"publish","type":"page","link":"https:\/\/www.librarypi.com\/index.php\/lp-stage-2\/","title":{"rendered":"LP Stage 2"},"content":{"rendered":"<p>In this stage, we will be adding OCR processing of uploaded pages, and allowing the BookReader to search the pages of the current book for text.\u00a0 The <a href=\"https:\/\/github.com\/tesseract-ocr\" target=\"_blank\">tesseract <\/a>program will be used for the OCR process.<\/p>\n<p>You can download a backup image of the Stage 2 system here (you will need to run <strong>sudi raspi-config<\/strong> to regain SD card space):<\/p>\n<div class='w3eden'><!-- WPDM Link Template: Default Template -->\n\n<div class=\"link-template-default card mb-2\">\n    <div class=\"card-body\">\n        <div class=\"media\">\n            <div class=\"mr-3 img-48\"><img decoding=\"async\" class=\"wpdm_icon\" alt=\"Icon\"   src=\"https:\/\/www.librarypi.com\/wp-content\/plugins\/download-manager\/assets\/file-type-icons\/zip.svg\" \/><\/div>\n            <div class=\"media-body\">\n                <h3 class=\"package-title\"><a href='https:\/\/www.librarypi.com\/index.php\/download\/library-pi-stage-2-image\/'>Library Pi Stage 2 Image<\/a><\/h3>\n                <div class=\"text-muted text-small\"><i class=\"fas fa-copy\"><\/i> 1 file(s) <i class=\"fas fa-hdd ml-3\"><\/i> 527 MB<\/div>\n            <\/div>\n            <div class=\"ml-3\">\n                <a class='wpdm-download-link download-on-click btn btn-primary ' rel='nofollow' href='#' data-downloadurl=\"https:\/\/www.librarypi.com\/index.php\/download\/library-pi-stage-2-image\/?wpdmdl=145&refresh=6a458b471f6b61782942535\">Download<\/a>\n            <\/div>\n        <\/div>\n    <\/div>\n<\/div>\n\n<\/div>\n<p>You can download the Stage2 PHP files here:<\/p>\n<div class='w3eden'><!-- WPDM Link Template: Default Template -->\n\n<div class=\"link-template-default card mb-2\">\n    <div class=\"card-body\">\n        <div class=\"media\">\n            <div class=\"mr-3 img-48\"><img decoding=\"async\" class=\"wpdm_icon\" alt=\"Icon\" src=\"https:\/\/www.librarypi.com\/wp-content\/plugins\/download-manager\/assets\/file-type-icons\/zip.svg\" \/><\/div>\n            <div class=\"media-body\">\n                <h3 class=\"package-title\"><a href='https:\/\/www.librarypi.com\/index.php\/download\/php-files-for-stage-2\/'>PHP Files for Stage 2<\/a><\/h3>\n                <div class=\"text-muted text-small\"><i class=\"fas fa-copy\"><\/i> 1 file(s) <i class=\"fas fa-hdd ml-3\"><\/i> 312.70 KB<\/div>\n            <\/div>\n            <div class=\"ml-3\">\n                <a class='wpdm-download-link download-on-click btn btn-primary ' rel='nofollow' href='#' data-downloadurl=\"https:\/\/www.librarypi.com\/index.php\/download\/php-files-for-stage-2\/?wpdmdl=142&refresh=6a458b4720c7a1782942535\">Download<\/a>\n            <\/div>\n        <\/div>\n    <\/div>\n<\/div>\n\n<\/div>\n<p>Please note that the image above was taken when the sample book was uploaded and about half done OCR&#8217;ing.\u00a0 This way you can confirm the scheduling is all working correctly as the book eventually completes.<\/p>\n<p>SQL script for stage 2<br \/>\n<div class='w3eden'><!-- WPDM Link Template: Default Template -->\n\n<div class=\"link-template-default card mb-2\">\n    <div class=\"card-body\">\n        <div class=\"media\">\n            <div class=\"mr-3 img-48\"><img decoding=\"async\" class=\"wpdm_icon\" alt=\"Icon\" src=\"https:\/\/www.librarypi.com\/wp-content\/plugins\/download-manager\/assets\/file-type-icons\/txt.svg\" \/><\/div>\n            <div class=\"media-body\">\n                <h3 class=\"package-title\"><a href='https:\/\/www.librarypi.com\/index.php\/download\/sql-stage-2\/'>SQL Stage 2<\/a><\/h3>\n                <div class=\"text-muted text-small\"><i class=\"fas fa-copy\"><\/i> 1 file(s) <i class=\"fas fa-hdd ml-3\"><\/i> 0.91 KB<\/div>\n            <\/div>\n            <div class=\"ml-3\">\n                <a class='wpdm-download-link download-on-click btn btn-primary ' rel='nofollow' href='#' data-downloadurl=\"https:\/\/www.librarypi.com\/index.php\/download\/sql-stage-2\/?wpdmdl=177&refresh=6a458b4721d8e1782942535\">Download<\/a>\n            <\/div>\n        <\/div>\n    <\/div>\n<\/div>\n\n<\/div><\/p>\n<p>&nbsp;<\/p>\n<h3>Install Tesseract<\/h3>\n<p><span style=\"color: #000000; font-family: Calibri; font-size: medium;\">Tesseract is the open source OCR package we will be using.\u00a0 It\u2019s another one-line install:<\/span><\/p>\n<blockquote><p>sudo apt-get install tesseract-ocr<\/p><\/blockquote>\n<p>To test it, download our sample file:<\/p>\n<blockquote><p>wget http:\/\/www.librarypi.com\/downloads\/Sample.jpg<\/p><\/blockquote>\n<p>Execute the following command to test tesseract:<\/p>\n<blockquote><p>tesseract Sample.jpg Sample<\/p><\/blockquote>\n<p>This may take several minutes to complete, but eventually output should appear like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-126\" src=\"http:\/\/www.librarypi.com\/wp-content\/uploads\/2016\/03\/tesseract1.png\" alt=\"tesseract1\" width=\"477\" height=\"66\" \/><\/p>\n<p>This should have output a Sample.txt file, so look at the top of that file:<\/p>\n<blockquote><p>head Sample.txt<\/p><\/blockquote>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-127\" src=\"http:\/\/www.librarypi.com\/wp-content\/uploads\/2016\/03\/tesseract2.png\" alt=\"tesseract2\" width=\"508\" height=\"198\" \/><\/p>\n<p>Now, compare that text with the actual Sample.jpg file, and we see tesseract did a decent job:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-128\" src=\"http:\/\/www.librarypi.com\/wp-content\/uploads\/2016\/03\/tesseract3.png\" alt=\"tesseract3\" width=\"436\" height=\"172\" \/><\/p>\n<h3>MySQL Changes<\/h3>\n<p>We need to add some new tables to our MySQL database to support the words on the pages.\u00a0 We&#8217;re going to create some new tables: a lp_word table to hold unique words, and an lp_page_word table to hold a list of the words on each page.\u00a0 The tables are defined as follows:<\/p>\n<pre>create table if not exists lp_word(\r\n\u00a0id integer\u00a0 NOT NULL AUTO_INCREMENT,\r\n\u00a0word varchar(32),\r\n\u00a0PRIMARY KEY PK_lp_word (id),\r\n\u00a0INDEX ilp_word (word)\r\n);\r\ncreate table if not exists lp_page_word(\r\n\u00a0id integer\u00a0 NOT NULL AUTO_INCREMENT,\r\n\u00a0page_id integer,\r\n\u00a0word_id integer,\r\n\u00a0seq integer,\r\n\u00a0posleft integer,\r\n\u00a0postop integer,\r\n\u00a0posright integer,\r\n\u00a0posbottom integer,\r\n\u00a0PRIMARY KEY PK_lp_page_word (id),\r\n\u00a0INDEX ilp_page_word_page (page_id),\r\n\u00a0INDEX ilp_page_word_word (word_id)\r\n);<\/pre>\n<p>Using MySQL_client or MySQL Workbench, execute the statements above in your database.<\/p>\n<h3>lp.php \u2013 Add OCR data to database<\/h3>\n<p>We need to modify the PHP &#8216;LP&#8217; class to add the ability to OCR text using tesseract and store the resulting words in the database for searching.\u00a0 We&#8217;re also going to add methods that help the BookReader support text searching.<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Method OCRFiles<\/strong><\/span><br \/>\nThis method will search the lp_pages table for any records with &#8216;N&#8217; status.\u00a0 It then invokes tesseract on that file to create the hocr file, and then set&#8217;s the file status to &#8216;O&#8217; indicating it was OCR&#8217;ed.<\/p>\n<pre>\u00a0function OCRFiles( ) \r\n\u00a0{\r\n\u00a0\u00a0$Ret = 0;\r\n\u00a0\u00a0$Cmd = 'select p.id, filename '.\r\n\u00a0\u00a0\u00a0' from lp_page p '.\r\n\u00a0\u00a0\u00a0' where status = \\'N\\'';\r\n\u00a0\u00a0$result = mysqli_query( db(), $Cmd );\r\n\u00a0\u00a0if ( $row = mysqli_fetch_array( $result ) )\r\n\u00a0\u00a0{\r\n\u00a0\u00a0\u00a0$ID = $row[\"id\"];\r\n\u00a0\u00a0\u00a0$Filename = $row[\"filename\"];\r\n   \/\/ Create same file with .hocr extension appended\r\n\u00a0\u00a0\u00a0exec( \"tesseract $Filename $Filename hocr\" ); \r\n   if( file_exists( $Filename.\".hocr\" ) )\r\n\u00a0\u00a0\u00a0{ \/\/ Update status to 'O' (for OCR'ed)\r\n\u00a0\u00a0\u00a0\u00a0  mysqli_query( db(), \"update lp_page set status='O' where id = $ID\" );\r\n\u00a0\u00a0\u00a0\u00a0  $Ret++;\r\n\u00a0\u00a0\u00a0}\r\n\u00a0\u00a0}\r\n\u00a0\u00a0$result-&gt;close();\r\n\u00a0\u00a0return $Ret;\u00a0\u00a0\u00a0\r\n\u00a0}<\/pre>\n<p><span style=\"text-decoration: underline;\"><strong>Method AddOCRRecords<\/strong><\/span><br \/>\nThis method will look gor lp_page records with &#8216;O&#8217; status.\u00a0 It will then load the corresponding hocr file for the image, and parse it into words.\u00a0 It calls GetWordID to add the word if needed to the database, and then records it&#8217;s position on the page into the lp_page_word table.<\/p>\n<pre>function AddOCRRecords()\r\n {\r\n $Ret = 0;\r\n $Cmd = 'select id, filename '.\r\n ' from lp_page '.\r\n 'where status = \\'O\\'';\r\n $result = mysqli_query( db(), $Cmd);\r\n $Max = 10;\r\n while ( $row = mysqli_fetch_array( $result ) )\r\n {\r\n $Max--;\r\n if( $Max == 0 )\r\n break;\r\n $PageID = $row[\"id\"];\r\n $Filename = $row[\"filename\"].'.hocr';\r\n $Text = file_get_contents( $Filename );\r\n $Seq = 0;\r\n while( ($P = strpos( $Text, \"&lt;span class='ocrx_word'\" )) !== FALSE )\r\n {\r\n $Text = substr( $Text, $P+23 );\r\n \u00a0\u00a0\u00a0\u00a0$P = strpos( $Text, \"&lt;\/span&gt;\" );\r\n $Word = substr( $Text, 0, $P );\r\n $P = strpos( $Word, 'bbox'\u00a0 );\r\n $Word = substr( $Word, $P+4 );\r\n $P = strpos( $Word, ';' );\r\n $Dim = substr( $Word, 0, $P\u00a0 );\r\n $P = strpos( $Word, '&gt;' );\r\n $Word = strip_tags( substr( $Word, $P+1 ) );\r\n $WordID = $this-&gt;GetWordID( $Word );\r\n if( $WordID &gt; 0 )\r\n {\r\n $Seq++;\r\n $ar = explode( ' ', $Dim );\r\n $Cmd = \"insert into page_word ( page_id, word_id, seq, posleft, postop, posright, posbottom ) values (\".\r\n \"$PageID, $WordID, $Seq, \".$ar[1].\",\".$ar[2].\",\".$ar[3].\",\".$ar[4].\")\"; \/\/ LTRB\r\n mysqli_query( db(), $Cmd );\r\n }\r\n }\r\n $Cmd = \"update lp_page set status='I' where id = $PageID\";\r\n mysqli_query( db(), $Cmd );\r\n $Ret++;<\/pre>\n<pre>}\r\n $result-&gt;close();\r\n return $Ret;\r\n }<\/pre>\n<p><span style=\"text-decoration: underline;\"><strong>Method GetWord<\/strong><\/span><br \/>\nThis method is called to get the ID for a word.\u00a0 If not found, it automatically adds it and returns the new ID.<\/p>\n<pre>\u00a0function GetWordID( $Word )\r\n\u00a0{\r\n\u00a0\u00a0$Word =\u00a0 $this-&gt;ProcessWord( $Word );\r\n\u00a0\u00a0if( strlen( $Word ) == 0 )\r\n\u00a0\u00a0\u00a0return 0;\r\n\u00a0\u00a0$result = mysqli_query( db(), \"select * from lp_word w where w.word = '$Word' \");\r\n\u00a0\u00a0if ( $row = mysqli_fetch_array( $result ) )\r\n\u00a0\u00a0\u00a0$Ret = $row[\"id\"];\r\n\u00a0\u00a0else\r\n\u00a0\u00a0{\r\n\u00a0\u00a0\u00a0mysqli_query( db(), \"insert into lp_word (word) values( '$Word') \");\r\n\u00a0\u00a0\u00a0$Ret\u00a0 = mysqli_insert_id ( db() );\r\n\u00a0\u00a0}\r\n\u00a0\u00a0$result-&gt;close();\r\n\u00a0\u00a0return $Ret;\r\n\u00a0}<\/pre>\n<p><strong><span style=\"text-decoration: underline;\">Method OutputSearchAndExit<\/span><br \/>\n<\/strong>This method will perform a search on the lp_page_word and related tables to locate words in pages.\u00a0 It is used by the util.php file, which in turn is used by the BookReader component.\u00a0 Together they allow the user to perform a text search on the uploaded books.<\/p>\n<p>This method is not listed here, to keep the\u00a0description simple. Leave a comment if you want more details.<\/p>\n<h3>New proc.php file<\/h3>\n<p>We are going to add a new file, proc.php, to our site.\u00a0 We are going to setup a scheduled task to invoke this page with wget every minute.\u00a0 The job of this script will be to look for pages that need to be OCR&#8217;ed or indexed.\u00a0 Please note that this page runs for at least 50 seconds, and quite possibly longer.\u00a0 If you call it up in your web browser, you must be VERY patient, and then it won&#8217;t actually output anything.\u00a0 It&#8217;s only called by the scheduler, not normally in a browser.<\/p>\n<pre>&lt;?php\r\n\u00a0include_once( 'lp.php' );\r\n\u00a0\r\n\u00a0function DoProcess() {\r\n\u00a0\u00a0$lp = new LP();\r\n\u00a0\u00a0$lp-&gt;OCRFiles();\r\n\u00a0\u00a0$lp-&gt;AddOCRRecords();\r\n\u00a0} \/\/ end of function DoProcess() \r\n\u00a0\r\n\u00a0\u00a0\u00a0 $fp = fopen( \"uploads\/file.flag\",\"w\");\r\n\u00a0\u00a0\u00a0 if (flock($fp, LOCK_EX)) {\r\n\u00a0\u00a0try {\r\n\u00a0\u00a0\u00a0$StopAt = time() + 50; \/\/ Run for 50 seconds\r\n\u00a0\u00a0\u00a0$Cnt = 0;\r\n\u00a0\u00a0\u00a0while( $StopAt &gt; time() )\r\n\u00a0\u00a0\u00a0{\r\n\u00a0\u00a0\u00a0\u00a0DoProcess();\r\n\u00a0\u00a0\u00a0\u00a0sleep( 2 );\r\n\u00a0\u00a0\u00a0\u00a0$Cnt++;\r\n\u00a0\u00a0\u00a0}\r\n\u00a0\u00a0} \r\n\u00a0\u00a0catch( Exception $e ) \r\n\u00a0\u00a0{\r\n\u00a0\u00a0\u00a0echo \"Exception: \" . $e-&gt; $e-&gt;getMessage();\r\n\u00a0\u00a0}\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 flock($fp, LOCK_UN);\r\n\u00a0\u00a0\u00a0 } \/\/ No else needed\r\n\u00a0fclose($fp);\r\n?&gt;<\/pre>\n<p><strong>Starting\u00a0scheduled job<\/strong><br \/>\nIn order to invoke proc.php every minute, we need to setup cron to call it.\u00a0 We do it by running crontab -e to edit the cront table:<\/p>\n<blockquote><p>crontab -e<\/p><\/blockquote>\n<p>Then, in the file want to add this line, so that wget is called to launch the proc.php file:<\/p>\n<blockquote><p>* * * * * wget -q http:\/\/localhost\/proc.php -O \/dev\/null<\/p><\/blockquote>\n<p>Once the schedule has started, you should see the indicator of &#8216;Pages to process&#8217; slowly dropping:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-135\" src=\"http:\/\/www.librarypi.com\/wp-content\/uploads\/2016\/03\/proc.png\" alt=\"proc\" width=\"309\" height=\"108\" \/><\/p>\n<p>Note that this is still a Raspberry Pi, and it may take a minute or two for each page to process.\u00a0 So be patient.<\/p>\n<h3>Testing Stage 2<\/h3>\n<p>Once all the files are in place, and the crontab modified so that proc.php is called, and enough time given to the pi to OCR and index a book, you should now be able to search for text in the book (BookReader):<\/p>\n<p>Enter text, like Lerner in the search field and click the Go button:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-139\" src=\"http:\/\/www.librarypi.com\/wp-content\/uploads\/2016\/03\/search1.png\" alt=\"search1\" width=\"350\" height=\"108\" \/><\/p>\n<p>You should see an animation indicating the pages that contain the text:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-140\" src=\"http:\/\/www.librarypi.com\/wp-content\/uploads\/2016\/03\/search2.png\" alt=\"search2\" width=\"131\" height=\"85\" \/><\/p>\n<p>When you go to that page, you should see the search text highlighted in blue:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-138\" src=\"http:\/\/www.librarypi.com\/wp-content\/uploads\/2016\/03\/search3.png\" alt=\"search3\" width=\"489\" height=\"347\" \/><\/p>\n<p>In <a href=\"http:\/\/www.librarypi.com\/index.php\/lp-stage-3\/\">Stage 3<\/a>, we will add user authentication and some security, as well as the ability to delete books.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this stage, we will be adding OCR processing of uploaded pages, and allowing the BookReader to search the pages of the current book for text.\u00a0 The tesseract program will be used for the OCR process. You can download a&#8230;<br \/><a class=\"read-more-button\" href=\"https:\/\/www.librarypi.com\/index.php\/lp-stage-2\/\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":3,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-23","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.librarypi.com\/index.php\/wp-json\/wp\/v2\/pages\/23","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.librarypi.com\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.librarypi.com\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.librarypi.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.librarypi.com\/index.php\/wp-json\/wp\/v2\/comments?post=23"}],"version-history":[{"count":22,"href":"https:\/\/www.librarypi.com\/index.php\/wp-json\/wp\/v2\/pages\/23\/revisions"}],"predecessor-version":[{"id":181,"href":"https:\/\/www.librarypi.com\/index.php\/wp-json\/wp\/v2\/pages\/23\/revisions\/181"}],"wp:attachment":[{"href":"https:\/\/www.librarypi.com\/index.php\/wp-json\/wp\/v2\/media?parent=23"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}