Engineering Journal of Don

Moving from a university data warehouse to a lake: models and methods of big data processing
- Abstract
- pdf (rus)
The article examines the transition of universities from data warehouses to data lakes, revealing their potential in processing big data. The introduction highlights the main differences between storage and lakes, focusing on the difference in the philosophy of data management. Data warehouses are often used for structured data with relational architecture, while data lakes store data in its raw form, supporting flexibility and scalability. The section ""Data Sources used by the University"" describes how universities manage data collected from various departments, including ERP systems and cloud databases. The discussion of data lakes and data warehouses highlights their key differences in data processing and management methods, advantages and disadvantages. The article examines in detail the problems and challenges of the transition to data lakes, including security, scale and implementation costs. Architectural models of data lakes such as ""Raw Data Lake"" and ""Data Lakehouse"" are presented, describing various approaches to managing the data lifecycle and business goals. Big data processing methods in lakes cover the use of the Apache Hadoop platform and current storage formats. Processing technologies are described, including the use of Apache Spark and machine learning tools. Practical examples of data processing and the application of machine learning with the coordination of work through Spark are proposed. In conclusion, the relevance of the transition to data lakes for universities is emphasized, security and management challenges are emphasized, and the use of cloud technologies is recommended to reduce costs and increase productivity in data management. The article examines the transition of universities from data warehouses to data lakes, revealing their potential in processing big data. The introduction highlights the main differences between storage and lakes, focusing on the difference in the philosophy of data management. Data warehouses are often used for structured data with relational architecture, while data lakes store data in its raw form, supporting flexibility and scalability. The section ""Data Sources used by the University"" describes how universities manage data collected from various departments, including ERP systems and cloud databases. The discussion of data lakes and data warehouses highlights their key differences in data processing and management methods, advantages and disadvantages. The article examines in detail the problems and challenges of the transition to data lakes, including security, scale and implementation costs. Architectural models of data lakes such as ""Raw Data Lake"" and ""Data Lakehouse"" are presented, describing various approaches to managing the data lifecycle and business goals. Big data processing methods in lakes cover the use of the Apache Hadoop platform and current storage formats. Processing technologies are described, including the use of Apache Spark and machine learning tools. Practical examples of data processing and the application of machine learning with the coordination of work through Spark are proposed. In conclusion, the relevance of the transition to data lakes for universities is emphasized, security and management challenges are emphasized, and the use of cloud technologies is recommended to reduce costs and increase productivity in data management.

Keywords: data warehouse, data lake, big data, cloud storage, unstructured data, semi-structured data
An approach to improving the quality of machine learning models in monitoring of complex systems based on metric spaces
- Bolshakov M.A.
- Khodakovsky V.A.
- Abstract
- pdf (rus)
The paper proposes an approach to improve the efficiency of machine learning models used in monitoring tasks using metric spaces. To solve this problem, a method is proposed for assessing the quality of monitoring systems based on interval estimates of the response zones to a possible incident. This approach extends the classical metrics for evaluating machine learning models to take into account the specific requirements of monitoring tasks. The calculation of interval boundaries is based on probabilities derived from a classifier trained on historical data to detect dangerous states of the system. By combining the probability of an incident with the normalized distance to incidents in the training sample, it is possible to simultaneously improve all the considered quality metrics for monitoring - accuracy, completeness, and timeliness. One approach to improving results is to use the scalar product of the normalized components of the metric space and their importance as features in a machine learning model. The permutation feature importance method is used for this purpose, which does not depend on the chosen machine learning algorithm. Numerical experiments have shown that using distances in a metric space of incident points from the training sample can improve the early detection of dangerous situations by up to two times. This proposed approach is versatile and can be applied to various classification algorithms and distance calculation methods.

Keywords: monitoring, machine learning, state classification, incident prediction, lead time, anomaly detection
Improving the efficiency of working with databases in PHP based on the use of PDO
- Abstract
- pdf (rus)
PHP Data Objects (PDOs) represent a significant advancement in PHP application development by providing a universal approach to interacting with database management systems (DBMSs). This article opens with an introduction describing the need for PDOs as of PHP 5.1, which allows PHP developers to interact with different databases through a single interface, minimising the effort involved in portability and code maintenance. It discusses how PDO can improve security by supporting prepared queries, which is a defence against SQL injection. The main part of the paper analyses the key advantages of PDO, such as its versatility in connecting to multiple databases (e.g. MySQL, PostgreSQL, SQLite), the ability to use prepared queries to enhance security, improved error handling through exceptions, transactional support for data integrity, and the ease of learning the PDO API even for beginners. Practical examples are provided, including preparing and executing SQL queries, setting attributes via the setAttribute method, and performing operations in transactions, emphasising the flexibility and robustness of PDO. In addition, the paper discusses best practices for using PDO in complex and high-volume projects, such as using prepared queries for bulk data insertion, query optimisation and stream processing for efficient handling of large amounts of data. The conclusion section characterises PDO as the preferred tool for modern web applications, offering a combination of security, performance and code quality enhancement. The authors also suggest directions for future research regarding security test automation and the impact of different data models on application performance.

Keywords: PHP, PDO, databases, DBMS, security, prepared queries, transactions, programming

Moving from a university data warehouse to a lake: models and methods of big data processing

An approach to improving the quality of machine learning models in monitoring of complex systems based on metric spaces

Improving the efficiency of working with databases in PHP based on the use of PDO

News

News archive