Text information extraction from images of modified text
Abstract
Text information extraction from images of modified text
Incoming article date: 17.06.2023This article describes development of a module which provides opportunity to extract text from images of modified text, which can be used to bypass existing information security software and spread sensitive information out of company. The developed module is based on Python programming language with additional libraries expanding basic functional. After creating a module, additional module allowing user to create modified text by themselves was made. Additional module uses a special dictionary that can change any letter to alternative and generate more modified texts in order to test and find the weak spots of a module. To integrate the module into company’s information infrastructure DLP-systems were chosen, because of their popularity and ease of the integration method. To integrate DLP-system and text extraction module we used a mail-server with BCC copies of a mail traffic to send text and images to our module local mail server, additional mechanisms extracts pictures and process them within the module, after what it sends back the image and the text from it. A few rounds of testing were done resulting in nearly 97% accuracy. Future development consider expanding for multi-row processing and adding new alternative symbols after first mention them in text by using a CNN or standard deviation of images pixel and pixel comparison.
Keywords: information security, data leakage, text analisys, image analisys, modified data analisys, protection against steganography