Posts tagged language

The S key, German special characters and a Linux keyboard

Occasionally in recent weeks, this blog has provided information on keyboard shortcuts for unusual characters (unusual for English that is. Ed.) on a Linux keyboard.

The last of these took the umlaut (diaresis) as its subject (posts passim).

German road sign for Schloßstraße in Erfurt

German road sign for Schloßstraße in Erfurt

Today, attention turns once again to German and the s key, which can produce two characters, depending upon the combination of keystrokes.

Depressing the AltGr key and s produces “ß“, the German sharp s or esszett, usually transcribed in English as ss.

The other character that can be produced is “§“, which can be produced with the AltGr, Shift and s keys.

It is believed to originate from the Latin signum sectionis, meaning section sign and usually turns up in with reference to legal documents.

Where more than one section of a legal text is involved, the sign is repeated, i.e §§.

Theresa May’s latest Brexit statement

Read the subtitles for a more honest version of what the worst Prime Minister in modern British history actually said in Sharm El Sheikh.

Welsh traffic news

After Wales’ 21-13 decisive victory in yesterday’s Six Nations rugby fixture in Cardiff, Traffig Cymru, the Welsh Government’s traffic information service, couldn’t resist having a bit of fun on Twitter at the expense of the England squad and English rugby fans.

Tweet reads Our control room have received a report of a broken down chariot heading away from Cardiff on the #M4 and traffic officers have been despatched to find it #findthechariot

Swing low…

I wonder if the chariot had been rescued by a band of angels before Mr Plod had a chance to find it… 😀

Gloucestershire Live reveals truth about The Independent Group

Gloucestershire Live is a sister title of the Bristol Post/Bristol Live and as such provides a similar mediocre quality of journalism to its victims readers.

Yesterday, it shook off that veil of mediocrity – albeit briefly – as its website published an item confirming what many believed concerning the main politics news story of the week: the exit of right-wing MPs from the Labour Party to form a breakaway group, as shown in the screenshot below.

Header for piece about Chuka Umunna reads Conservatives

My Gloucestershire friends have this morning confirmed via social media that as far as the governance of the county is concerned, politics inevitably equals the Conservatives and the Blue Team dominate what is effectively a de facto one-party state.

Hat tip: Westengland.

Meet Victor

In Ireland, any predominantly Irish-speaking area is known as a Gaeltacht (plural: Gaeltachtaí). The island’s Gaeltachtaí are shown in green on the map below.

Map of Irish-speaking areas of Ireland

The green-shaded area beneath the Dingle Peninsula is the Iveragh Peninsula (Irish: Uíbh Ráthach) in County Kerry and an interesting appointment has just been made here.

Yesterday Irish broadcaster RTE reported that a Russian had been appointed as an Irish language officer there and would be leading efforts to revive the Irish language there.

RTE states:

Victor Bayda, a native of Moscow, has taken up the post with Comhchoiste Ghaeltacht Uíbh Ráthaigh, a community organisation in the south Kerry Gaeltacht of Uíbh Ráthach.

Mr Bayda is a fluent Irish speaker and has been teaching it in Moscow for about fifteen years. In addition to Irish, Mr Bayda also speaks Dutch, Scots Gaelic, Welsh, Swedish, French, German and Icelandic.

His duties in his new post will include implementing a comprehensive language plan aimed at arresting the decline of the language on the peninsula, where 60% of the residents claim the ability to speak Irish.

According to the 2016 Irish census, just 7% of the Gaeltacht population speak Irish daily outside the education system.

Mr Bayda becomes the tenth Irish language planning officer to be appointed so far in Gaeltacht areas.

In 2017, Victor posted the video below on Youtube.

Focus on OCR

The way a completed translation has been produced has changed markedly over the decades since my first days as a translator for Imperial Tobacco in Bedminster, Bristol.

In those days I’d write out the translation in longhand from printed source material and take my manuscript to the typing pool where it would be transformed into typescript.

The next big change came with my learning how to touch-type. By this time I was a freelance with no more access to a typing pool.

In my early freelance days, it was rare to get editable copy that one could overkey with one’s usual word processor, spreadsheet or presentation package. The standard way of working was still from hard copy propped up in a copyholder alongside one’s keyboard.

Then there came a large surge in the use of formats such as PDF – Portable Document Format. This format enables documents, including text formatting and images, to be presented in a manner independent of application software, hardware and operating systems.

If the PDF was text-based, one could simply export the text as plain ASCII text or copy and paste it into a word processor.

However, if I had an image-based PDF to work with, my usual answer was to print it out as hard copy to be propped up in a copyholder alongside my keyboard. This was very expensive in terms of paper and other consumables for the printer, even with a machine as parsimonious as my trusty mono laser printer, whose cartridge was good for printing 3,000 or so pages of copy.

In addition to the expense of printing, there was a far greater drawback to bear in mind, i.e. one could easily miss a sentence or paragraph from the original text when keying in the translated from a hard copy original, with the consequent implications for the quality of the finished work and the client’s satisfaction with it.

Then I discovered OCR – Optical Character Recognition – the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text.

Here’s a short video explaining the basics of OCR.

My preferred OCR package is gImageReader and – as with the software I recommend for use by translators – is open source and available for both Linux and Windows.

Gimagereader in action on Italian language PDF

gImageReader in action on Italian language PDF

gImageReader provides a simple graphical front-end to the tesseract OCR engine. The features of gImageReader include:

  • Importing PDF documents and images from disk, scanning devices, clipboard and screenshots;
  • Process multiple images and documents in one go;
  • Manual or automatic recognition area definition;
  • Recognising to plain text or to hOCR documents;
  • Recognized text displayed directly next to the image;
  • Post-processing of the recognised text, including spellchecking;
  • Generating PDF documents from hOCR documents.

I generally just stick scanning the input file to plain text, which can then be fed into a regular office suite for translation. If your office suite can handle HTML that’s the format gImageReader outputs as its hOCR output.

The tesseract OCR engine mentioned above can also be enhanced with language packs for post-recognition spellchecking, as mentioned in the features above. At present, tesseract can recognise over 100 different languages.

In addition to GUI-based OCR, there are also Linux packages available which can perform OCR via the command line interface.

My tool of choice here is OCRmyPDF.

ocrmypdf in action in KDE Konsole terminal

ocrmypdf being used in KDE’s Konsole terminal to add OCR layer to Spanish language PDF

OCRmyPDF is a package written in Python 3 that adds OCR layers to PDFs and, like gImageReader, also uses the tesseract OCR engine.

Using OCRmyPDF on the command line is simplicity itself (as shown in the screenshot above:

ocrmypdf -l [language option] inputfile.pdf outputfile.pdf

More complicated command options are possible, but after using that simple string above, you’ll be able to extract the text from your formerly image-based PDF ready for translation.

By way of conclusion depending on the software itself, OCR packages can also extract text from images such as .jpg files.

An algorithm with a sense of humour?

Google’s developers evidently have a sense of humour, as the search below shows.

Google's response to search string anagram reads did you mean nag a ram

Not all humour from techies is quite so obvious to ordinary mortals and is normally deeply buried in comments in code, mark-up and the like.

Tip of the hat: Kevin Mills

Mold: no interpreter; justice delayed

Although it’s not hitting the national headlines any more, the Ministry of Justice’s disastrous decision to outsource interpreting services for courts and tribunals in England and Wales continues to delay the administration of justice.

The latest instance your correspondent has discovered occurred last Friday at Mold, according to the County Times.

Mold Law Courts

Mold Law Courts. Image courtesy of Wikimedia Comms.

That day a case against two Vietnamese defendants, Quan Vu and Bang Vu, had to adjourned as no Vietnamese interpreter had been arranged to attend court for their plea hearing.

Both defendants are charged with being concerned in the production of cannabis in Newtown in Powys.

Judge Niclas Parry adjourned the case. A plea hearing will now take place on Friday, 8th March, with the trial date set provisionally for 15th April.

Mr Justice Parry remarked that a letter of explanation was required as to why no interpreter had been arranged.

In the meantime both defendants remain in custody.

Easy umlauts on a Linux keyboard

Some weeks ago, I blogged about the keyboard shortcut for guillemets – French quotation marks – on a Linux keyboard (posts passim).

My attention in this post is on the German umlaut, also known as diaresis (or in French as a trema. Ed.) the two dots placed over a vowel modifying its pronunciation.

Once again, one could always use the character map to insert a specific vowel with an umlaut.

KCharselect with an upper case A umlaut selected

KCharselect with an upper case A umlaut selected

However, the keyboard shortcut is much quicker.

To produce the letter a with an umlaut – “ä“, follow these steps.

Depress AltGr key and the left-hand square bracket “[” followed by “a“.

The AltGr and left-hand bracket symbol plus the vowel of your choice will give you that character plus an umlaut.

For the upper case version, I find the easiest way to avoid knotting your fingers is to turn on the CapsLock key before the AltGr key and the left-hand square bracket “[” plus vowel sequence.

Stupid boy! Another pair of lookalikes

It’s no secret that Gavin Williamson MP, the current Secretary of State for Defence, is nicknamed Private Pike, after Frank Pike, the fictional Home Guard private and junior bank clerk in the BBC television comedy Dad’s Army, who was frequently referred to by platoon commander Captain Mainwaring as “stupid boy“.

Composite of Private Pike and Gavin Williamson

Young Gavin, who is the Member of Parliament for South Staffordshire, had a real stupid boy moment last week.

On Monday, in a gung-ho speech to the Royal United Services Institute, Williamson confirmed that the first of Britain’s next-generation aircraft carriers, the Queen Elizabeth, will tour the Pacific as part of its maiden voyage and that the vessel likely to tour the South China Sea at a time of growing tensions regarding China’s territorial ambitions.

Needless to say this has not gone down well in Beijing, resulting in a planned trade visit by Chancellor Philip Hammond being cancelled.

Even former Chancellor George Osborne has commented, also alluding to Williamson as a stupid boy, but using rather more words, as iNews reports:

You have got the defence secretary engaging in gunboat diplomacy of a quite old-fashioned kind at the same time as the chancellor of the exchequer and the foreign secretary are going around saying they want a close economic partnership with China.

Go to Top