Buya dissertation: Optical Character Recognition (OCR)

Optical reference cut back mention (OCR)INTRODUCTION1.1. Optical theatrical role RecognitionOptical comp 1nt Recognition (OCR) is the robot akin or electronic interpretation, variation of figs of render write, type pen or printed text editionual matter edition ( ordinarily captured by a s tailner or t commensuratet) into appliance-editable text.OCR is a playing playing field of investigate in pattern appellation, artificial intelligence and machine vision. An OCR carcass enables you to take a book or a magazine article, be given it directly into an electronic reck aner file, and then edit the file development a parole carry throughor.All OCR systems include an optical scanner for education text, and suave softw atomic number 18 for analyzing delineations. Most OCR systems use a mishmash of hardw be (specialized circuit boards) and softwargon to realize purposes, although around economical systems do it unaccompanied through softw ar. Advanced roman OCR sys tems can rede text in grown intermixture of fonts, scarcely they still harbour difficulty with hand write text.1.2. History Of Optical mention RecognitionTo comprehend the phenomena described in the above section, we realise to look at the history of OCR 3, 4, 6, its improvement, quotation methods, computer technologies, and the differences between valets and machines 1, 2, 5, 7, 8. It is always intriguing to be able to rein ways of enabling a computer to ape human functions, handle the ability to read, to deliver, to see things, and so on. OCR explore and development can be traced back to the early 1950s, when scientists tried to confine the images of maculations and texts, first by mechanical and optical means of rotating disks and photomultiplier, flying spot scanner with a cathode give off tube lens, followed by photocells and arrays of them. At first, the examine operation was dawdling and unrivalled line of component parts could be digitized at a time by co ntemptible the scanner or the paper medium. Subsequently, the contraptions of drum and flatbed scanners arrived, which extended scanning to the full page. Then, advances in digital-integ posed circuits brought photo arrays with higher solidity, immediate transports for documents and higher f number in scanning and digital con interpretings.These vital improvements bigly accelerated the recreate of constitution recognition and abridged the cost, and out-of-doorsed up the possibilities of touch on a great range of patterns and documents. Throughout the 1960s and 1970s, mod OCR lotions sprang up in retail businesses, banks, hospitals, post authorisations insurance, railroad, and aircraft companies newspaper publishers, and umpteen other industries 3, 4.In parallel with these advances in hardw ar development, rigorous research on personality recognition was fetching place in the research laboratories of twain academic and industrial sectors 6, 7. Although both recognition techniques and computers were non that powerful in the in the early hours (1960s), OCR machines tended to authorize masses of errors when the print quality was poor, caused either by wide contrast in type fonts and roughness of the surface of the paper or by the cotton ribbons of the typewriters 5. To make OCR work proficiently and economically, there was a king-size ram from OCR manufacturers and suppliers toward the modularization of print fonts, paper, and ink qualities for OCR applications. New fonts such as OCRA and OCRB were designed in the 1970s by the American subject Standards show (ANSI) and the atomic number 63an Computer Manufacturers Association (ECMA), respectively. These special fonts were quickly approved by the International Standards Organization (ISO) to facilitate the recognition process 3, 4, 6, 7. As an upshot, in truth high appointment rates became achievable at high travel and at reasonable costs. Such accomplishments also brought better printing tr aits of info and paper for matter-of-fact applications. Actually, they completely revolutionize the data input attention 6 and eliminated the jobs of thousands of keypunch operators who were doing the really mundane work of keying data into the computer.1.3. Common Steps Of OCR impactThe method of converting documents into electronic forms, which is usually referred to as digitization is undertaken in antithetical steps.The process of scanning a document and representing the scanned image for further processing is called the pre-processing or imaging stage.The process of manipulating the scanned image of a document to get under ones skin a searchable text is called the OCR processing stage.1.3.1. The Imaging StageThe imaging outgrowth involves scanning the document and storing it as an image. The most popular image stage used for this purpose is called Tagged-Image File Format (TIFF).The resolution (number of dots per inch dpi) determines the accurateness rate of the OCR p rocess.1.3.2. The OCR ProcessThe study steps of the OCR processing stage are shown below.1.3.3. Distinguishing amidst Text And Images SegmentationIn this step, the process of recognizing the text and image blocks of the scanned image is undertaken. The boundaries of to apiece one image are analyzed in send to differentiate the text.1.3.4. Character Recognition Feature ExtractionThis step involves recognizing a eccentric person using a process cognise as feature extraction. OCR tools stockpiles rules roughly the characters of a given mitt using a method known as the teaching course. A character is then identified by analyzing its construct and comparing its features adjacent to a mountain of rules stored on the OCR railway locomotive that distinguishes each character.1.3.5. Recognition Of CharacterFollowing the character realisation process, character detection process is performed by comparing the string of characters against an existing dictionary of boys. Additi onal processes such as spell-checking are performed under this step.1.3.6. Output data formattingThe finishing step involves storing the output in one of the industry stayard formats such as RTF, PDF, WORD and plain UNICODE text.1.4. conventionalism RecognitionPattern recognition (also known as classification or pattern classification) is a field within the vicinity of artificial intelligence and can be defined as the act of taking in raw data and taking an action based on the category of the data. It uses methods from statistics, machine learning and other vicinities.Typical applications of pattern recognition areAutomatic savoir-faire identification.Classification of text into numerous categories (e.g. spam/non-spam email messages).The unbidden identification of handwritten postal codes on postal envelopes.The involuntary identification of images of human faces etc.The preceding three examples form the subtopicimage analysis of pattern recognition that pact with digital image s as input to pattern recognition systems. many trendy techniques for pattern recognition includeNeural Networks(NN)Hidden Markov Models(HMM)Bayesian networks (BN)The application domains of pattern identification includeComputer Vision mould VisionMedical Image AnalysisOptical Character RecognitionCredit Scoring.1.5. Applications Of The Pattern RecognitionPattern recognition has many useable applications. Some of them are outlined below.Utilizes as a tele colloquy fiscal aid for deaf, in airline reservation, in postal department for postal turn teaching (both handwritten and printed postal codes/addresses) and for medical diagnosis.For use in customer billing as in telephone exchange billing system, ordinate data logging, and automatic finger print identification, as an automatic superintendence system.In automated cartography, metallurgical industries, computer assisted forensic linguist system, electronic mail, tuition units and libraries and for facsimile.For direct proc essing of documents as a useful document reader for large scale data processing, as a micro-film reader data input system, for high speed data entry, for changing text/graphics into a computer readable form, as electronic page reader to handle large muckle of mail.1.6. Scope Of This nominateThe Project is designed to classify and identify a scanned image containing Arabic characters using two pace approaches. In the first pace the Arabic text image is preprocessed. And in the second pace it features are extracted. During the course of work it is assumed that there is no noise in the image and the image is flawlessly scanned with no deviation from its original angle no skewing.1.7. Objectives And Applications Of This WorkArabic Optical Character Recognition can open a novel way of realizing the dream of the natural mode of communication amid man and machine in this part of the world. It will inflate and cipher already forthcoming knowledge to new horizons. Centurys aged rare bridge player in Arabic, Urdu and Persian will become available to common man.The net goal of character recognition is to conjure up the human course session capabilities. Character recognition systems can contribute immensely to the advancement of the mechanization process and can improve the interaction among man and machine in many applications, including office automation, check verification and a large compartmentalization of banking, business and data entry applications, library archives, documents identifications, e-books producing, invoice and shipping notice processing, sub helping handion collections, questionnaires processing, exam papers processing and many other applications9, beside online address and mark training material.1.8. Thesis OrganizationThe rest part of this thesis is divided into 4 chapters. Chapter 2 describes review of literature. Chapter 3 describes Arabic script, its peculiarities and problems. Chapter 4 is regarding the development of Arabi c Character identification and chapter 5 is about conclusions and future directions respectively.Chapter 2 round OF LITERATURE2.1. Optical Character RecognitionSince the beginning of writing as a form of communication, paper prevailed as the medium for writing. Electronic media is refilling paper with time. Because it preserves space and is fast to access, electronic media are forever gaining esteem. The convenience of paper, its pervasive used for communication and archiving, and the quantity of information already on paper, press for quick and accurate methods to automatically read that information and adapt it into electronic form Albadr95.The latent application areas of automatic reading machines are numerous. One of the earliest, and most thriving, applications is sorting checks in banks, as the volume of checks that circulates daily has proven to be too huge for manual entry. most other applications are detailed in the next section Govindan90, Mantas86.The machine sour of human reading (i.e. optical character recognition) has been the subject of widespread research for more than five decades. Character identification is pattern recognition application with a crucial aim of simulating the human reading capabilities of both machine printed and handwritten cursive text. The currently available systems whitethorn interpret faster than humans, but cannot reliably read such a wide diverseness of text nor consider context. One can say that a great quantity of further effort is required to, at least, narrow the fissure between humans reading and machines reading capabilities. The practical significance of OCR applications, as well as the interesting reputation of the OCR problem, has lead to great research interest and assessable advances in this field. Now, technical OCR systems for Latin characters are normally accessible on personal computers achieving recognition rates above 99% McClelland91, Welch93. Further, systems on the market can now interpr et a variety of writing styles (e.g., hand-written, printed Omni-font), and character sets including Chinese, Japanese, Korean, Cyrillic, and Arabic.Since the 50s, researchers have carried out far-reaching work and promulgated many papers on character recognition. Nearly all of the published work on OCR has been on Latin, Japanese or Chinese characters. This has come forthed since the median 40s for Latin, the meat of the 1960s for Chinese and Japanese. The followers are po placeive surveys and reviews on Latin character recognition. Reference whitethorn be made to Mori92 for historical judgement of OCR research and development. The survey of Govindan90 includes surveys of other speech communications Mantas86 has an overview of character identification methodologies, Impedovo91 on mercantile OCR systems, Tian91 on machine-printed OCR, Tappert90, Wakahara92 for on-line manus identification. Suen80 has a survey on automatic identification of hand printed characters (viz. numera ls, alphanumeric, FORTRAN, and Katakana), succession Nouboud90 produced a review of the recognition of hand-printed (non-cursive) characters and conducted of import tests on a business system. Bozinovic89, Simon92 surveyed off-line cursive rule book recognition, Jain et al Jain2000 reviewed statistical pattern recognition methods, and Plamondon2000 comprehensive survey of online and offline paw identification. Two bibliographies of the fields of OCR and document scrutiny appeared in Jenkins93, Kasturi92. Stallings76, Mori84, produced surveys on identification of Chinese machine- and hand-printed characters, respectively, and Liu et al Liu2004 addressed the state of the art of online identification of Chinese characters.2.2. General Review Of Arabic Character RecognitionAlthough almost one billion people world-wide, in several diverse speechs, use Arabic characters for writing (Arabic, Persian, and Urdu are the most noted examples), Arabic character identification has not been r esearched as thoroughly as Latin, Japanese, or Chinese. The first published work on Arabic character acknowledgment may be traced back to 1975 by Nazif Nazif75 in his masters thesis. In his thesis a system for the identification of printed Arabic characters was developed based on extracting strokes that he called radicals (20 radicals are used) and their come ins. He used correlation between the templets of the deep-seated and the character image. A section phase was included to segment the cursive text. eld later Badi and Shimura Badi78, Badi80 and Noah Nouh80 toiled on printed Arabic characters and Amin Amin80 on hand-written Arabic characters. Surveys on AOTR may be referred in Amin85a, Amin98, Shoukry89, Jambi91, Albadr95, Nabawi2000, Ahmed94.On-line systems are restricted to recognizing hand-written text. Some systems recognize remote characters Ali89, Amin80, Amin85b, Amin87, ElSheikh89, ElSheikh90b, ElWakil87, ElWakil89, Saadallah85 and hand-written mathematical formulas E lSheikh90c, Amin91b, while others recognize cursive words Badi78, Badi80, Badi82, Amin82a, Amin82b, Shaheen90, AlEmami90. Since the segmentation problem in Arabic is non-trivial the concluding systems deal with a very much harder problem.While several off-line systems use video cameras to digitize pages of text (e.g., Abbas86, Goraine92, Amin86, HajHassan85, HajHassan90, Nouh80, Nouh87, Nouh89, Sarfraz2003, Sarfraz2004), the inclination now is to use scanners with resolutions ranging from 200 to 400 dots per- inch (e.g., AbdelAzim89c, AbdelAzim90a, AlYousefi88, Amin91a, Bouhlila89, ElDabi90, ElSheikh88a, Ramsis88, Sarfraz2003a, Sarfraz2003b, Zidouri2002, Zidouri2005). Scanners set up less noise to an image, are less pricey, and more commodious to use for character recognition, especially when coupled with automatic document feeders, automatic Binarization, and image enhancement.Among the off-line systems that identify hand-written isolated characters are Abuhaiba90, AlYousefi90, A lTikriti85, ElDesouky92, Hyder88. Abbas86, AbdelAzim89b, Goneid92 identify hand-written Arabic (Hindi) numerals, and Badi80, Badi82, Goraine92, Jambi92, Zahour91 distinguish hand-written words. The majority of off-line systems distinguish typewritten cursive words AbdelAzim89c, AbdelAzim90a, Bouhlila89, ElDabi90, Amin86, ElKhaly90, ElSheikh88b, Goraine89, Khella92, Margner92, Nazif75, Nouh87, Ramsis88, Tolba89, Tolba90, ElRamly89c, HajHassan90, HajHassan91, while ElShiekh88a, Mahdi89, Mahmoud94, Nouh80, Nouh89, NurulUla88, Fayek92, Sarfraz2005d, Zidouri2005 identify however typewritten isolated characters. The systems of Abdelazim90b, AlBadr92, ElGowely90, Kurdy92, Fakir93 are intended to recognize set words. One of the systems Abdelazim89a recognizes bilingual (Arabic/Latin) typewritten words. Examples of systems for detection of other languages that use Arabic script are Parhami81, Yalabik88, Hyder88, which are designed for the identification of Persian, Ottoman (Old Turkish), a nd Urdu, respectively.2.3. Applications Of Optical Character RecognitionOptical character recognition technology has many practical applications that are independent of the treated language. The following are some of these applications financial Business ApplicationsFor cataloging bank checks since the number of checks per day has been far too large for manual arrangement.Commercial Data ProcessingFor inflowing data into commercial data processing files, for example inflowing the names and addresses of mail order customers into a database. In addition, it can be worn as a work sheet reader for payroll accounting.In Postal segmentFor postal address reading, cataloging and as a reader for handwritten and printed postal codes.In Newspaper IndustryPremium typescript may be read by recognition equipment into a computer typesetting system to keep past from typing errors that would be introduced by keypunching the text on computer peripheral device equipment.Use By BlindIt is used as a reading abet using photo sensor and tactile simulators, and as a sensory aid with sound output. Additionally, it can be worn for reading text sheets and reproduction of Braille originals.In Facsimile TransmissionThis action involves transmission of brilliant data over communications channels. In practice, the pictorial data is mainly text. Instead of transmitting characters in their pictorial representation, a character identification system could be used to recognize each character then transmit its text code. Finally, it is worth to say that the major potential application for automatic character identification is as a general data entry for the automation of the work of an ordinary office typist.2.4. Development Of New OCR TechniquesAs OCR research and development advanced, demands on handwriting identification also increased because a lot of data (such as addresses written on envelopes sums written on checks names, addresses, identity numbers, and dollar set written on invoic es and forms) were written by hand and they had to be pierce into the computer for processing. But early OCR techniques were based generally on template matching, child desire line and geometric features, stroke detection, and the extraction of their derivatives.Such techniques were not classy enough for practical identification of data handwritten on forms or documents. To cope with this, the Standards Committees in the United States, Canada, Japan, and some countries in Europe designed some handprint models in the 1970s and 1980s for people to write them in boxes 7. Hence, characters written in such specified shapes did not take leave too much in styles, and they could be recognized more slowly by OCR machines, especially when the data were pierced by controlled groups of people, for example, employees of the same gild were asked to write their data like the advocated models. Sometimes writers were asked to follow certain subvention instructions to enhance the quality of the ir samples, for example, write big, close the loops, use simple shapes, do not link characters, and so on. With such constraints, OCR detection of handprints was able to flourish for a number of years.2.5. Recent Trends And MovementsAs the years of complete(a) research and development went by, and with the birth of several new conferences and workshops such as IWFHR (International Workshop on Frontiers in Handwriting Recognition), 1 ICDAR (International Conference on Document Analysis and Recognition), 2 and others 13, identification techniques advanced rapidly. Moreover, computers became much more authoritative than before. People could write the way they normally did, and characters need not have to be written like specified models, and the subject of unobstructed handwriting recognition gained considerable momentum and grew swiftly. As of now, many new algorithms and techniques in pre-processing, feature extraction, and powerful classification methods have been urbanized 8, 9.C hapter 3ARABIC A CURSIVE SCRIPT3.1. ArabicArabic is a semantic language used as principal language in most countries. Arabic is vocalized by 234 million people 9 and essential in the culture of many more. While spoken Arabic varies across region, written Arabic, sometimes called Modern Standard Arabic (MSA), is a uniform version used for official communication across the Arab world 9. The characters of Arabic script and similar character are used by a much higher entitlement of the worlds population to write language such as Arabic, Farsi, Persian and Urdu. Thus the ability to automate the understanding of written Arabic would have wide spread benefits.Arabic is normally written in the calligraphic Nastaliq script, whereas Naskh is more commonly used. Usually, bare transliterations of Arabic into Roman letters pull many phonemic elements that have no counterpart in position or other languages commonly written in the Roman first principle. National delivery Authority of Pakistan h as developed numeral systems with specific notations to signify non-English sounds, but these can only be appropriately read by someone already familiar with Urdu, Persian, or Arabic for letters such as ? ? ? ? or ? and Hindi for letters. Most of Arabic characters when pooled form a arcdegree of about 45 to the horizontal line because of which Arabic script reading is faster than roman script but on the other hand it makes it harder for the greenhorn readers and the machines to identify the word or segment one character from the rest.Unlike the English script there is no capital or small characters in Urdu, but the last character of a word can be measured as a capital character as in many cases it presents the full form of the character and the characters at early and kernel positions are considered as small. Every character has an unreserved shape besides different get together forms, but some of the alphabet like the characters making the word Urdu (? ? ? ?) or of the similar category are not joinable or cannot be machine-accessible. Arabic alphabet utilizes consonant letters, vowels, discriminating marks, numerals, punctuations and a few superscripts signs.The graphical representation of each alphabet has redundancy one form depending on its position and context in the word. In general each letter has four forms that is beginning, middle, final and standalone as shown in table 3.1.3.2. Arabic LettersThe Arabic alphabet contains 28 letters. individually has between two and four shapes and the choice of which shape to use depends on the situation of the letter within its word or sub word. The shape correspond to the four positions beginning of a (sub) word, middle of a (sub) word. decision of a (sub) word and in isolation. Table 3.1 shows each shape for each letter. Letters without initial shapes are purely their isolated shapes, and their medial shapes are their final shapes.Some letters have descanters or ascenders which are position that extend b elow the primary line on which the letters sit or above the stature of most letters. Theres no upper or cast down case, but only one case. Arabic script is written from righteousness to left, andLetters within a word are usually joined even in machine print. Letter shapes and whether or not to connect depend on the letter and its neighbors. Letters are committed at the same virtual top side. The baseline is the line at the height at which letters are allied, and it is akin to the line on which some an English word sits. Letters are wholly above it unpack for decanters and some markings. Theres no association between separate words. So word boundaries are always represented by a breathing space. half-dozen letters, however, can be allied only on one side. When they turn over in the middle of a word, the word is divided into manifold sub-words disjointed by space.A ligature is a word shaped by combining two or more letters in an recognized manner. Arabic has numerous standard ligatures, which are exception to the above rules for joining letters. Most common is laam- alif, the combination of laam and alif and other include yaa-meem.3.3. Problems Of Arabic book of accountDespite a huge character set Arabic has a small set of characters which are easily discernible from one another. The remaining character fluctuates from these character using dots or symbols above or below these shapes 19. The table 3.2 shows group of similar characters and their derived forms.As shown above table 3.2, only 21 different groups exits out of 32 character set. It will flummox the identification phase of Arabic characters. Further study of other forms ( initial, middle and final ) of these character divulges that ein( ) is analogous to hamza(?), wow (?) qualification be suspicious with (?) , ze (?) resembles noon () and mem(?) can be baffled with middle form of ein () and with stand alone goal-he (?).A key distinction between Latin scripts and Arabic script is the fact tha t many letters only differ by a dot(s) but the primary stroke is exactly the same. 193.4. Others Problems In Arabic OCRAll Muslims (almost of the people on the earth) can read Arabic because it is the language of Al-Quran, the holy book of Muslims. Even though, Arabic script identification has not received enough welfare by the researchers. Little research appear has been accomplished comparing to the one done on the Latin and Chinese. The elucidations available in the market are still far from being complete 11, 14. There are few raison dtres led to this result.Require of financial support and platform accessible from any government (official language of countries).neediness of ample support in terms of journals, books etc. and lack of interaction between researchers in this playing fieldlack of broad-spectrum support utilities like Arabic text databases, dictionaries, programming tools, and supporting staffbelatedly start of Arabic text identification (first publication in 1975 compared with the 1940s in the case of Latin character recognition)The research carried out on Arabic language is typically scattered and outside from the Arab world.There are no specialized conferences or symposium demeanor so far.Algorithms developed for other language scripts are not pertinent on Arabic.3.5. Characteristics Of Arabic CharactersThe calligraphic nature of the Arabic set is eminent from other languages in several ways. For example,Arabic text is written from right to left.No upper or lower cases subsist in Arabic, but sometimes the last character of a word is considered as upper case because its always remains in its full form.Arabic has 28 fundamental characters, of which 16 have from one to three dots. Those dots discriminate between the otherwise similar characters. Additionally, three characters can have a meander like stroke. The dots are called secondaries and they are fixed above the character primary part as in ALEF (?), or below like BAA (?), or in the mi ddle like JEEM (?).Written Arabic text is cursive mutually in machine-printed and hand-written text. at heart a word, some characters unite to the preceding and/or following characters, and some do not connect. The connectivity of characters consequences in a word having one or more connected components. We will refer to each connected piece of a word as a sub-word.The shape of an Arabic character depends on its location in the word a character might have up to four different shapes depending on it being isolated, connected from the right (beginning form), connected from the left (ending form), or connected from both sides (middle form).A distinguishing feature of Arabic writing is the presence of a base-line. The baseline is a level line that runs through the connected portions of text (i.e. where the characters connection segments are located). The baseline has the highest number of text pixels. (See figure 3.2.)Characters in a word may overlie vertically (even without touching). Arabic characters do not have permanent size (height and width). The character size varies according to its pose in the word,Characters in a word can have diacritics. These diacritics are written as strokes, placed either on top of, or below, the characters. Poles by diacritic on a character may change the sum of a word. Readers of Arabic are accustomed to reading un-diacritical text by deducing the meaning from context.Numerous characters can combine vertically to form a ligature, especially in typeset and handwritten text.Arabic words may perhaps consist of one or more sub-words. Each sub-word may have one or more characters, because some Arabic characters are not joinable to others from the left side. As an example, the word Ketab ( ) consists of two sub-words Keta ( ) which consists of three characters and BAA( ?) which is a single character.There are merely three characters that represent vowels, ? , ? or ? . However, there are other shorter vowels represented by diacritics in the form of over come tos or underscores but practice of over score and underscore in Arabic is lessDots may materialize as two separated dots, touched dots, hat or as a stroke.Another style of Arabic handwriting is the arty or cosmetic calligraphy which is usually full of overlapping making the identification process even more difficult by human being earlier than by computers.3.6. SummaryArabic script includes its cursive nature of writings, right to left style of writing and change of form and shape when a character is placed at different locations of a word, loops, half closed(a) characters and dots on above or below a character. National Language Authority defined 32 characters set but it has 21 works characters beside numeral and diacritics.Chapter 4ARABIC CHARACTER RECOGNITION4.1. Phases Of Arabic Character RecognitionIn an offline character identification system, the user scans a event script, runs the OCR and gets the documents saved in a file format of his choice. The alteration of the text from the scanning phase to the final document involves a number of phases that are transparent to the user. The proposed system can be implemented in the following stepsImage AcquisitionDigitizationPreprocessingFeature extractionRecognition. externalize 4.1 shows the componen

Buya dissertation

.

Friday, March 29, 2019

Optical Character Recognition (OCR)

No comments:

Post a Comment