WP1 - Identifying the requirements, techniques, and AI models published in the current literature. Identifying the limitations of current techniques, proposing improvements, and collecting databases associated with the requirements.
D1.1 This activity aimed to interpret and analyze the project requirements, based on the terms of reference and guided through periodic (bi-weekly) meetings with the beneficiary, as well as through documents containing questions and answers from them. The aspects addressed in this report can be summarized as follows: (i) description of the general requirements of the project; (ii) description of the specific requirements of the project related to algorithms, the integrated software platform, and the data used within the project; (iii) the scenarios used in algorithm development; (iv) requirements related to data and application security.
D1.2 We study the scientific literature on Artificial Intelligence algorithms that will be developed within the project. Thus, the deliverable contains information about the following algorithms: (i) algorithms for content description, context extraction, and entity similarity; (ii) algorithms for classifying human actions; (iii) algorithms for re-identifying individuals based on general visual appearance, facial similarity, and voice similarity; (iv) algorithms for vehicle identification; (v) algorithms for geolocation in urban environments; (vi) algorithms for detecting textual entities.
D1.3 This activity aims to study and validate the software architecture that will be implemented within the project, targeting the following points: (i) the working software architecture, connection to data sources and protocols; (ii) the data processing module; (iii) data storage; (iv) the intelligent reporting module; (v) the GIS map module; (vi) algorithm topologies and methods.
D1.4 This activity identifies data sources that can be used for training, validating, and testing the algorithms of this project. Where available, databases that closely match the scenarios described in Activity I.1 were analyzed and selected. Thus, databases were analyzed for the following algorithms: (i) algorithms for content description, context extraction, and entity similarity; (ii) algorithms for classifying human actions; (iii) algorithms for re-identifying individuals based on general visual appearance, facial similarity, and voice similarity; (iv) algorithms for vehicle identification; (v) algorithms for geolocation in urban environments; (vi) algorithms for detecting textual entities.
D1.5 It aimed to identify GDPR and data security requirements for the proposed project. The following aspects were analyzed: (i) defining and explaining GDPR terms, as well as the national and European legislative framework encompassing these regulations; (ii) aspects regarding the processing of personal data for the databases in this project; (iii) applying GDPR in the context of artificial intelligence law, with reference to the use of cameras, surveillance tools, and specific algorithms.
WP2 Research, design, development, and implementation of relevant innovative AI algorithms and software modules
D2.1 This deliverable presents, in a unified and complete form, the technical specifications of all Artificial Intelligence algorithms developed within the project, describing both the functionality of each module and the software configurations required for their execution, providing for each section precise details about the versions of Python, CUDA, and operating system used during testing, as well as the sets of Python libraries needed for optimal operation. In the end, the document also establishes the general specifications regarding the delivery method to the beneficiary (with emphasis on the requirements of the next Stage), by providing the source code in GitHub repositories and by containerizing each algorithm so that they can be run as independent web services through Docker containers and uvicorn, facilitating their later integration into the beneficiary’s infrastructure.
D2.2 This deliverable describes the specifications of the databases used in the training of all algorithms developed in the project, highlighting both the datasets taken from the literature and the newly built (or modified) databases, where the project requirements could not be satisfied by existing resources.
D2.3 This report describes, at a conceptual level, the set of algorithms developed for classifying and detecting entities from images, integrated into a system that combines classical visual recognition with modern transformer-based models. For object detection, the DETR architecture was used, and the image context classification task is performed using a Vision Transformer (ViT) model trained to distinguish between the 20 relevant classes defined in the project. Additionally, the project integrates an LLM-type language model used in zero-shot mode, which automatically generates a short description of the images and enumerates the visible human activities, without requiring additional training.
D2.4 The method proposed in this deliverable for recognizing human actions is built as a multi-stage pipeline, which combines person detection, tracking over time, and action classification based on a transformer-type video model. In the first stage, persons are identified in each frame of the video sequence and tracked from one frame to the next in order to generate coherent trajectories, a process that ensures the correct extraction of the video segments in which each person appears. These segments are then sent to a three-dimensional Video-Vision Transformer (ViViT) model, capable of simultaneously analyzing spatial and temporal information and determining the performed action.
D2.5 This activity presents the two methods developed for person re-identification: silhouette analysis and facial feature analysis. In the first case, the approach allows flexible alignment of body regions and produces more robust representations, useful in real surveillance scenarios where detections are not perfectly aligned. Based on experiments, several feature vector sizes were tested, and the optimal variant was selected so as to offer a balance between accuracy and the space needed to store descriptors generated in real time. For face recognition, a well-established model from the literature was integrated, which unifies detection, alignment, and facial feature extraction. It ensures high performance in uncontrolled conditions, making it suitable for operational scenarios where the quality and stability of results are essential.
D2.6 This section describes the service developed for person re-identification in the audio domain, designed as a modular, containerized system that combines voice activity detection with a vocal feature extraction model and a set of REST endpoints for enrollment, verification, and re-identification. In the preprocessing area, the service integrates two VAD models – SpeechBrain and Silero, which can be activated or used depending on the scenario at the user’s command.
D2.7 The solution proposed within this activity integrates a complete vehicle recognition system, capable of identifying both the body color and the make, model, and generation, using a custom dataset adapted to Romanian traffic. The approach starts from the idea that generalist models trained on international data fail to capture the particularities of the local fleet or the specific variations of traffic captures, which is why building a proprietary database was necessary. The system combines a robust detector for identifying vehicles in the scene with a semantic segmentation module that filters the background and extracts the relevant areas for the two tasks. For color recognition, only the body panels are retained, which reduces ambiguities caused by reflections or non-uniform components. For make and model recognition, the system focuses on the front part, where the differences between generations are most visible. This guided filtering ensures clean and stable inputs for classification, improving performance in difficult scenarios.
D2.8 The model proposed for this activity adopts a two-stage architecture for image geolocation, in which a DETR module is used to isolate the building as the only relevant element of the scene, and a Vision Transformer model subsequently processes exclusively this filtered region to extract stable architectural features. By eliminating variable urban context and unwanted correlations — such as the presence of vehicles, traffic signs, or traffic-specific elements — the preprocessing stage with DETR reshapes the visual distribution of the data and provides a clear structural signal focused on the geometry and composition of the facade. In this framework, the Vision Transformer becomes a building classifier, operating on a coherent, standardized representation that maximizes separability between points of interest and minimizes the effects of external factors, thus ensuring a robust and controlled learning process of architectural specificity.
D2.9 This activity proposes, for relation extraction from Romanian texts, a pipeline organized around an algorithmic routing mechanism that combines, in a modular architecture, two complementary LLM models and a suite of pre- and post-processing components. At the core level, the extraction task is performed by an encoder–decoder mT5 model, trained in seq2seq mode on a corpus dedicated to the written press domain. In parallel, a service based on RoGemma is used in few-shot mode, through controlled prompting, for scenarios where generative flexibility is advantageous. The flow is orchestrated by a Router / Gateway that decides, depending on context, the optimal route (mT5 vs. RoGemma) and the conditional application of a coreference module based on RoGemma, introduced to maintain referential coherence in long texts and to allow sentence-by-sentence processing within context window limits.
D2.10 The multimodal entity search and resolution module (MER) is designed as an integration layer on top of existing security algorithms, with the role of correlating their results based on details provided by the user (names, identifiers, textual descriptions) or from already recorded events. Access to the processed data is achieved through a unified protocol based on FastAPI web services, which expose endpoints for each processed data source. The responses are standardized in JSON format and may include, besides event metadata, algorithm-specific results (for example, scores and regions of interest for image/video re-identification or elements extracted through NER), thus facilitating the construction of an iterative search chain in which information obtained from one algorithm is propagated to the others.
D2.11 The data source interconnection module functions as a unified multimodal access layer, allowing the algorithms developed in the project to coherently process video, audio, and textual streams regardless of their technical origin. The architecture is built to offer compatibility with sources found in real infrastructures, including IP cameras through standardized protocols (ONVIF, RTSP, RTP), local files or video resources managed by VMS/NVR systems, as well as common audio files (WAV, MP3, WMA) and text documents accessible locally or over the network. Furthermore, each developed algorithm will implement Docker containers that will have integrated the data access component according to each modality.
D2.12 This activity proposes the formalization of the steps for integrating the algorithms into the final platform. Three private GitHub repositories were developed, corresponding to the responsibilities of each partner, constituting the basis for distribution and testing of the software modules in the previous deliverables. This integration model demonstrates how the algorithms can be containerized, distributed, and later integrated into external applications or infrastructures, ensuring a uniform access protocol, interoperability between services, and reproducibility in testing and validation.
D2.13 In this stage of the project, a functional module for person re-identification based on silhouette features was integrated, allowing the user to generate a gallery of recordings, subsequently select a query image, and obtain, through a standardized configuration and execution process, the list of instances in which the person is detected. The interface of the integrated application allows authentication and rapid configuration of the data sources that will be processed by the algorithms developed in the project. After authentication, the user can define video, audio, or text instances by selecting either local files or network resources, each instance then being assignable to a processing algorithm.
D2.14 In this activity, each partner provided the beneficiary with the current versions of the AI algorithms through private GitHub repositories, ensuring a transparent evaluation flow and a unified structure for reproducing results, including source code, execution instructions, virtual environment specifications, and links to external resources. The activity was accompanied by regular meetings with the beneficiary, in which the necessary technical adaptations were discussed, as well as continuous communication through issues opened in the repositories, facilitating the tracking of feedback and changes. For the next stage, the testing and validation process will continue in the same manner, with emphasis on performance optimization and full integration of the algorithms. At the end of the project, all components will be delivered as containerized web services, together with Dockerfile and docker-compose files, allowing their execution.