{"id":8876,"date":"2019-02-21T11:24:12","date_gmt":"2019-02-21T11:24:12","guid":{"rendered":"http:\/\/xislblogs.xtreamlab.net\/slwoods\/?p=8876"},"modified":"2019-02-21T12:24:36","modified_gmt":"2019-02-21T12:24:36","slug":"focus-on-ocr","status":"publish","type":"post","link":"https:\/\/www.slwoods.co.uk\/?p=8876","title":{"rendered":"Focus on OCR"},"content":{"rendered":"<p>The way a completed translation has been produced has changed markedly over the decades since my first days as a translator for Imperial Tobacco in Bedminster, Bristol.<\/p>\n<p>In those days I&#8217;d write out the translation in longhand from printed source material and take my manuscript to the typing pool where it would be transformed into typescript.<\/p>\n<p>The next big change came with my learning how to touch-type. By this time I was a freelance with no more access to a typing pool.<\/p>\n<p>In my early freelance days, it was rare to get editable copy that one could overkey with one&#8217;s usual word processor, spreadsheet or presentation package. The standard way of working was still from hard copy propped up in a copyholder alongside one&#8217;s keyboard.<\/p>\n<p>Then there came a large surge in the use of formats such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/PDF\">PDF &#8211; Portable Document Format<\/a>. This format enables documents, including text formatting and images, to be presented in a manner independent of application software, hardware and operating systems.<\/p>\n<p>If the PDF was text-based, one could simply export the text as plain <a href=\"https:\/\/en.wikipedia.org\/wiki\/ASCII\">ASCII<\/a> text or copy and paste it into a word processor.<\/p>\n<p>However, if I had an image-based PDF to work with, my usual answer was to print it out as hard copy to be propped up in a copyholder alongside my keyboard. This was very expensive in terms of paper and other consumables for the printer, even with a machine as parsimonious as my trusty mono laser printer, whose cartridge was good for printing 3,000 or so pages of copy.<\/p>\n<p>In addition to the expense of printing, there was a far greater drawback to bear in mind, i.e. one could easily miss a sentence or paragraph from the original text when keying in the translated from a hard copy original, with the consequent implications for the quality of the finished work and the client&#8217;s satisfaction with it.<\/p>\n<p>Then I discovered OCR &#8211; Optical Character Recognition &#8211; the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text.<\/p>\n<p>Here&#8217;s a short video explaining the basics of OCR.<\/p>\n<div class=\"epyt-video-wrapper\"><iframe loading=\"lazy\"  style=\"display: block; margin: 0px auto;\"  id=\"_ytid_76143\"  width=\"600\" height=\"450\"  data-origwidth=\"600\" data-origheight=\"450\" src=\"https:\/\/www.youtube.com\/embed\/3XNziEwFvA4?enablejsapi=1&autoplay=0&cc_load_policy=0&cc_lang_pref=&iv_load_policy=1&loop=0&rel=0&fs=1&playsinline=0&autohide=2&theme=dark&color=red&controls=1&disablekb=0&\" class=\"__youtube_prefs__  no-lazyload\" title=\"What is OCR?\"  allow=\"fullscreen; accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen data-no-lazy=\"1\" data-skipgform_ajax_framebjll=\"\"><\/iframe><\/div>\n<p>My preferred OCR package is <a href=\"https:\/\/sourceforge.net\/projects\/gimagereader\/\">gImageReader<\/a> and &#8211; as with the <a href=\"http:\/\/xislblogs.xtreamlab.net\/slwoods\/?page_id=66\">software I recommend for use by translators<\/a> &#8211; is open source and available for both Linux and Windows.<\/p>\n<figure id=\"attachment_8929\" aria-describedby=\"caption-attachment-8929\" style=\"width: 600px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/xislblogs.xtreamlab.net\/slwoods\/wp-content\/uploads\/\/sites\/23\/2019\/02\/gimagereader.png\" alt=\"Gimagereader in action on Italian language PDF\" width=\"600\" height=\"285\" class=\"size-full wp-image-8929\" srcset=\"https:\/\/www.slwoods.co.uk\/wp-content\/uploads\/\/sites\/23\/2019\/02\/gimagereader.png 600w, https:\/\/www.slwoods.co.uk\/wp-content\/uploads\/\/sites\/23\/2019\/02\/gimagereader-300x143.png 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><figcaption id=\"caption-attachment-8929\" class=\"wp-caption-text\">gImageReader in action on Italian language PDF<\/figcaption><\/figure>\n<p>gImageReader provides a simple graphical front-end to the <a href=\"https:\/\/opensource.google.com\/projects\/tesseract\">tesseract OCR engine<\/a>. The features of gImageReader include:<\/p>\n<ul>\n<li> Importing PDF documents and images from disk, scanning devices, clipboard and screenshots;<\/li>\n<li>Process multiple images and documents in one go;<\/li>\n<li>Manual or automatic recognition area definition;<\/li>\n<li>Recognising to plain text or to <a href=\"https:\/\/en.wikipedia.org\/wiki\/HOCR\">hOCR<\/a> documents;<\/li>\n<li>Recognized text displayed directly next to the image;<\/li>\n<li>Post-processing of the recognised text, including spellchecking;<\/li>\n<li>Generating PDF documents from hOCR documents.<\/li>\n<\/ul>\n<p>I generally just stick scanning the input file to plain text, which can then be fed into a regular office suite for translation. If your office suite can handle <abbr title=\"HyperText Markup Language\">HTML<\/abbr> that&#8217;s the format gImageReader outputs as its hOCR output.<\/p>\n<p>The tesseract OCR engine mentioned above can also be enhanced with language packs for post-recognition spellchecking, as mentioned in the features above. At present, tesseract can recognise over 100 different languages.<\/p>\n<p>In addition to GUI-based OCR, there are also Linux packages available which can perform OCR via the command line interface.<\/p>\n<p>My tool of choice here is <a href=\"https:\/\/ocrmypdf.readthedocs.io\/en\/latest\/introduction.html\">OCRmyPDF<\/a>.<\/p>\n<figure id=\"attachment_8930\" aria-describedby=\"caption-attachment-8930\" style=\"width: 600px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/xislblogs.xtreamlab.net\/slwoods\/wp-content\/uploads\/\/sites\/23\/2019\/02\/ocrmypdf.png\" alt=\"ocrmypdf in action in KDE Konsole terminal\" width=\"600\" height=\"420\" class=\"size-full wp-image-8930\" srcset=\"https:\/\/www.slwoods.co.uk\/wp-content\/uploads\/\/sites\/23\/2019\/02\/ocrmypdf.png 600w, https:\/\/www.slwoods.co.uk\/wp-content\/uploads\/\/sites\/23\/2019\/02\/ocrmypdf-300x210.png 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><figcaption id=\"caption-attachment-8930\" class=\"wp-caption-text\">ocrmypdf being used in KDE&#8217;s Konsole terminal to add OCR layer to Spanish language PDF<\/figcaption><\/figure>\n<p>OCRmyPDF is a package written in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\">Python<\/a> 3 that adds OCR layers to PDFs and, like gImageReader, also uses the tesseract OCR engine.<\/p>\n<p>Using OCRmyPDF on the command line is simplicity itself (as shown in the screenshot above:<\/p>\n<pre>ocrmypdf -l [language option] inputfile.pdf outputfile.pdf<\/pre>\n<p>More complicated command options are possible, but after using that simple string above, you&#8217;ll be able to extract the text from your formerly image-based PDF ready for translation.<\/p>\n<p>By way of conclusion depending on the software itself, OCR packages can also extract text from images such as .jpg files.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The way a completed translation has been produced has changed markedly over the decades since my first days as a translator for Imperial Tobacco in Bedminster, Bristol. In those days I&#8217;d write out the translation in longhand from printed source material and take my manuscript to the typing pool where it would be transformed into [&hellip;]<\/p>\n","protected":false},"author":20,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13,3,5,9],"tags":[40,22,12,23],"class_list":["post-8876","post","type-post","status-publish","format-standard","hentry","category-bristol","category-translation-and-language-related-matters","category-open-source-software","category-tech","tag-bristol","tag-language","tag-open-source","tag-tech-2"],"_links":{"self":[{"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/8876","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=\/wp\/v2\/users\/20"}],"replies":[{"embeddable":true,"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8876"}],"version-history":[{"count":17,"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/8876\/revisions"}],"predecessor-version":[{"id":8943,"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/8876\/revisions\/8943"}],"wp:attachment":[{"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8876"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8876"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.slwoods.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8876"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}