Text Extraction from Video using Deep Learning

Page 1


International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 04 | Apr 2024 www.irjet.net p-ISSN: 2395-0072

Text Extraction from Video using Deep Learning

,

3 , S.

4

1234 Dept. of Computer Science and Engineering, Government College of Engineering Srirangam, Tamilnadu, India

Abstract - In the modern-day virtual landscape, video content has expanded unexpectedly in quantity across diverse structures, imparting a wealth of records. However, extracting textual records from these videos accurately andefficaciously remains agreatassignment. This paper proposes an approach to extract textual content from video content by way of employing a combination of Optical Character Recognition (OCR) algorithms andConvolutional RecurrentNeuralNetwork (CRNN). By leveraging the strengths of both OCR and CRNN, our approach ambitions to beautify the accuracy and performance of text extraction from video lectures, tutorials,andeducationalcontentmaterial.Theextracted textual content serves as a precious reference for college students, enriching their mastering enjoy. This study contributes tounlockingtheuntappedpotentialoftextual records embeddedwithin videocontentmaterial,thereby facilitating better access to knowledge inside the virtual age.

Key Words: Optical Character Recognition (OCR), Convolutional Recurrent Neural Network (CRNN), Text Extraction,VideoContent,InformationRetrieval

1.INTRODUCTION

In today's digital age, video content has become ubiquitousondifferentplatformsandoffersalargestorage ofinformationtousersaroundtheworld.However,despite thewealthofknowledgecontainedinthesevideos,accessing andutilizingthetextualinformationtheycontainremainsa challengingchallenge.Accurateandefficienttextextraction fromvideocontentiskeytoincreasingtheaccessibilityand usability of this information, especially in educational contextssuchaslectures,tutorials,andinstructionalvideos.

Traditionalmethodsofextractingtextfromvideos oftenrelyonopticalcharacterrecognition(OCR)techniques, whichcanstrugglewithlow-qualityimages,distortedtext, andcomplexbackgrounds.OCRalgorithmscanstruggleto accuratelyrecognizetextthatdeviatesfromstandardfonts orstyles,suchashandwrittentext,artisticfonts,ordistorted characters.Thislimitationmayleadtoerrorsorincomplete text extraction. Additionally, OCR alone may not capture contextorstructurewell,especiallyinscenarioswheretext is embedded in dynamic visuals. On the other hand, convolutional neural networks (CNNs) have shown remarkable success in image recognition tasks, including textdetectionandextraction.ByutilizingbothOCRandCNN

algorithms, we can potentially improve the accuracy and efficiencyoftextextractionfromvideocontent.WhereOCR is used for text detection from frames and segmentation, imageandCNNareusedfortextrecognition.CNNstrained onlargedatasetscangeneralizewellacrossdifferentfonts andwritingstyles,makingthecombinedapproachversatile andapplicabletoawidevarietyofvideocontent.

1.1 RELATED WORK

K.S.RaghunandanandPalaiahnakoteShivakumara [1]addressesrobusttextdetectionandrecognitioninmultiscript-oriented images. Previous research has employed techniques including bit plane slicing, Iterative Nearest NeighborSymmetry(INNS),MutualNearestNeighborPair (MNNP) components, character detection using fixed windows, contourlet wavelet features with SVM classifier, andHiddenMarkovModels(HMM)forrecognition.

Xu-ChengYinandXuwangYin[2]presentsamethod for accurate text detection in natural scene images. Their approach employs a fast-pruning algorithm to extract Maximally Stable Extremal Regions (MSERs) as character candidates,followedbygroupingthemintotextcandidates usingsingle-linkclustering.Automaticlearningofdistance metrics and thresholds is incorporated, and posterior probabilities of text candidates are estimated using a characterclassifiertoeliminatenon-textregions.Evaluation of the ICDAR 2011 Robust Reading Competition database demonstratesanf-measureofover76%,surpassingstate-ofthe-artmethods,withfurthervalidationonvariousdatabases confirmingitseffectiveness.

Pinaki Nath Chowdhury, and Palaiahnakote Shivakumara[3]presentsanewmethodfordetectingtexton humanbodiesinsportsimages,addressingchallengessuch aspoorimagequalityanddiversecameraviewpoints.Unlike conventional methods, it employs an end-to-end episodic learning approach that detects clothing regions using a ResidualNetwork(ResNet)andPyramidalPoolingModule (PPM) for spatial attention mapping. Text detection is performed using the Progressive Scalable Expansion Algorithm(PSE).Evaluationofvariousdatasetsdemonstrates superior precision and F1-score compared to existing methods,confirmingeffectivenessacrossdifferentinputs.

2 METHODOLOGIES

Our proposed approach for extracting text from video content leverages the strengths of optical character

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 04 | Apr 2024 www.irjet.net p-ISSN: 2395-0072

recognition (OCR) for detecting text regions and ConvolutionalRecurrentNeuralNetworks(CRNN)fortext recognition,ensuringrobustandaccuratetextextraction.

2.1 Preprocessing:

Itisthefirstcrucialstepinourproposedvideotext extraction approach. Basically, it involves cleaning and preparingvideodatatomakeitmoresuitableforbothOCR (text detection) and CRNN (text recognition) to perform theirtaskseffectively.Here'sabreakdownofwhathappens duringpreprocessing:

A. Video segmentation:

The video is first divided into individual frames. Each frame represents a single still image captured at a specificpointintimeduringvideorecording.Thisallowsus to analyze the content of the text in each frame independentlyitsshowintheFig.1below.

B. Grayscale conversion:

Thecolorinformationinthevideoisnotcriticalfor text extraction, converting images to grayscale can be beneficial.Grayscaleimagesonlycontainluminance(shades of gray) information, which can sometimes simplify the imageandmakeiteasierforOCRandCRNNtofocusontext elements. Grayscale images or videos are often used in various applications such as image processing, computer vision,andmedicalimaging,wherecolorinformationmay notbenecessaryormayevenbeadistraction.Itisshownin theFig.2.

C. Noise reduction:

Videoscanbepronetonoisefromvarioussources, such as camera imperfections or compression artifacts. Noisereductiontechniquesaimtoremovetheseunwanted elements from frames. This helps improve overall image

quality and clarity, resulting in more accurate text extraction.It’sshowninFig.3andFig.4below.

D. Contrast Enhancement:

Thegoalofthistechniqueistoimprovethecontrast between the text and the background in each image. By increasingthecontrast,thetextbecomesmoreprominent, makingit easierforbothOCR andCRNN torecognizeand identifycharacters.ItisshowninFig.5below.

2.2 Text detection using OCR:

VideotextextractionusesOCRasaninitialsearchto identify potential regions of each frame that may contain text.Thebreakdownofthisstepisasfollows:

A. Detecting areas of text:

The OCR engine uses a variety of techniques to identify areas of the frame that exhibit similar text. These techniqueswillincludethefollowing:

I.Thresholding:Convertagrayscaleimagetoabinaryimage. Pixels with values higher than a threshold are considered part of the text, while other pixels are considered background.ItisshownintheFig.6

ii.Contiguous object detection: This technique identifies a contiguous area of pixels that can represent a single characterorlineoftextinabinaryimage.Itisshowninthe Fig.7

iii.Morphologicaloperations:Theseoperationscanbeused tocorrectthepositionoftextbyremovingnoise,fillingsmall gaps in characters, or smoothing edges. It is shown in the Fig.8

Fig.6 ThresholdingofText

Fig.1.Videotoframes
Fig.2.TheimagecontainstexttoGrayscaleimage
Fig.3.BeforeNoise Fig.4.AfterNoise Reduction Reduction
Fig.5.ContrastEnhancement

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 04 | Apr 2024 www.irjet.net

Fig.7.LineSegmentation

Fig.8.WordSegmentation

2.3. Text recognition using CRNN:

The proposed approach for video text extraction, takes center stage after OCR has identified potential text regionswithineachvideoframe.Here'showCRNNswork:

A. The Power of Pre-trained CRNN Model:

Weutilizeapre-trainedCRNNmodelspecializedin textrecognitiontasks.Thismodelundergoestrainingona vast dataset consisting of 200,000 images annotated with corresponding labels. Through this training process, the CRNNgainsproficiencyindiscerningcomplexpatternsand featureswithinimagesindicativeofindividualcharacters.

B. Extracting the Text:

EachtextregionidentifiedbyOCRanddelineatedby aboundingboxwithinthevideoframeisextracted. Thecroppedimagecontainingthepotential textservesas inputfortheCRNNmodel.

C. Feature Extraction:

TheCRNNmodelanalyzescroppedimagespixelby pixel, focusing not solely on letter recognition but also on extractingcontextualfeaturescrucialforunderstandingtext. Featuressuchasstrokes,curves,andspatial relationships between pixels contribute to the differentiation between similarcharacters.

D. Prediction Time:

Uponreceivingthevisualimage,theCRNNmodel utilizesitslearnedknowledgetopredictthemostrelevant featuresextractedfromtheimage,effectivelyconvertingthe visual representation of the text into its corresponding textualform.

Fig.9.PredictedTextusing CRNNmodel

2.4 Integration and output:

Recognizedtextfromallframesisintegratedbasedontheir framenumberinthevideo.Thiscanincludetechniquessuch as hidden Markov models (HMMs) to handle potential overlappingorsequentialtextelementsacrossframes. Thefinaloutputistheextractedtextcontentfromtheentire video,possiblyincludingtimestampsorlinkstoimagesfor eachpieceoftext.ItisshownintheFig.10

3.

Fig.10.GenerationoftextDocument

EXPERIMENTAL EVALUATION

3.1 Environment specifications:

The experimental configuration employed in the study. The research work takes place on an AMD Ryzen 7 CPU. Furthermore, the machine has 16GB of RAM and an Nvidia graphics card. The models are constructed using Python and are executed using deep learning frameworks suchasKera’sandTensorFlow.

3.2 Dataset:

This paper is performed using mjsynth named dataset,whichcontainsimagesoftextwithitslabelvalue. The image in the dataset is contain different kind of text imagelikedifferentfont,sizeandpositions.

The dataset used in the study. In the dataset, text imageshaveatotalof150000imagesthedatasetwaspresplit into 2 sections: train data, which contains approximately 80% of the total images, and valid data (considered as test data), which contains approximately 20%ofthetotalimages.

Firstextractalloftheimagesfromeachdirectory. Theimagesinthedatasetarechecktheheightandwidthof theimage.Imagesin Train Data havea meanheightof 31 andmeanwidthofalmost75%ofImageshavewidth136 andheight31Almost95PercentileoftheImageshavewith of190comparativelyfewernumberofImageshavewidth> 200.TheFig.12forthetrainingimageisshowinbelow.

Volume: 11 Issue: 04 | Apr 2024 www.irjet.net

3.3 Result analysis:

After analyzing the results, it’s evident that the Convolutionrecurrentneuralnetworkachievedthehighest accuracyof 95% ofcharacter predictionand 91% of word prediction. Users access the application through a web browser,wheretheycanuploadimages fortextextraction fromvideo.

3.4 Deployment:

The text extraction from video project is deployed via a web-based platform built on the Django. Usersaccesstheapplicationthroughawebbrowser,where they can upload text contain Video or image for text extraction. TheDjango backend handles imageprocessing tasksusingthetrainedmodel,whilethefrontenddisplays resultssuchasatextextractedfromvideoorimagealsothe user can download the text as txt file. The Fig of the Deploymentisshownbelow.

4. CONCLUSIONS

This project presents a robust system for extracting text fromvideocontent,achievinganimpressiveaccuracyrateof 90%. By integrating Convolutional Neural Networks (CRNNs),OpticalCharacterRecognition(OCR),andOpenCV, the system effectively identifies and extracts text regions fromvideoframeswithhighprecision.Throughmeticulous preprocessing,frameextraction,textdetectionusingCRNNs, andsubsequenttext recognition withOCR algorithms,the system demonstrates reliable performance in accurately identifying and converting text regions into machinereadable format. The utilization of OpenCV for video processingandvisualizationfurtherenhancesthesystem's capabilities,enablingtheoverlayingofextractedtextonto videoframesforvisualfeedback.Additionally,thesystem's architectureallowsforthestorageofextractedtextintext files, facilitating downstream applications such as video indexing,contentanalysis,andinformationretrieval.Witha 90%accuracyrate,thisprojectdemonstratesasignificant advancement in the field of computer vision and natural language processing. The high accuracy achieved underscores the system's effectiveness in automating text extractionfrommultimediacontent,therebyaddressingthe growing need for efficient and accurate text analysis in variousdomains.

Fig.11.DatasetSample
Fig.13.HomePage
Fig.14.VideoUpload
Fig.15.TextRegioncrop
Fig.16.TextStoreinFile

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 04 | Apr 2024 www.irjet.net p-ISSN: 2395-0072

REFERENCES

[1] Pinaki Nath Chowdhury, Palaiahnakote Shivakumara, RamachandraRaghavendra,“AnEpisodicLearningNetwork forTextDetectiononHumanBodiesinSportsImages”,IEEE TransactionsonCircuitsandSystemsforVideoTechnology, 2021.

[2] Raghunandan, K. S., et al. "multi-script-oriented text detection and recognition in video/scene/born-digital images."IEEEtransactionsoncircuitsandsystemsforvideo technology29.4(2018):1145-1162.

[3]XuwangYin,KaizhuHuang,Hong-WeiHao,Xu-ChengYin, “Robust Text Detection in Natural Scene Images”, IEEE TransactionsonPatternAnalysisandMachineIntelligence, Vol.36,No.5,May2014.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.