SBIR-STTR Award

Log-driven Infrastructure Analytics and Management (LIAM)
Award last edited on: 1/5/2023

Sponsored Program
SBIR
Awarding Agency
DOE
Total Award Amount
$1,900,000
Award Phase
2
Solicitation Topic Code
C51-05b
Principal Investigator
Partha Bhaumik

Company Information

Ennetix Inc (AKA: Putah Green Solutions)

1477 Drew Avenue Suite 106
Davis, CA 95618
   (530) 574-7084
   info@ennetix.com
   www.ennetix.com
Location: Multiple
Congr. District: 03
County: Yolo

Phase I

Contract Number: DE-SC0021575
Start Date: 2/22/2021    Completed: 10/21/2021
Phase I year
2021
Phase I Amount
$250,000
The ubiquity of cloud-delivered applications and services and the always-on nature of personal and business communications have driven data traffic to grow at unprecedented rates and created virtualized, dynamic, and distributed application-delivery infrastructures. Assuring availability, security, and performance in such an environment poses a real challenge to IT departments. Therefore, traditional IT Ops has given way to DevOps to speed up IT’s service response to rapidly-changing demands from their stakeholders. The rate of configuration changes, which include software updates in a DevOps environment, is by design an order of magnitude greater than in a traditional IT Ops environment. Now, IT organizations are trying to leverage machine learning and advanced analytics to further automate and improve responsiveness of infrastructure services. The impetus for configuration changes is originating not just from their stakeholders, but also from the increasing use of software-defined elements in the infrastructure. This new trend is referred to as Algorithmic IT Ops (AIOps). An environment that uses automated provisioning and software-defined or- chestration cannot ignore the impact of frequent configuration changes/updates (manifested in system/server logs) on application infrastructure performance. Additionally, in a distributed infrastructure, the impact of third-party-managed services on application and network performance is extremely significant. Thus, it becomes imperative to understand what rele- vant events are occuring outside the enterprise’s traditional infrastructure boundaries and how those events impact its ability to meet its performance objectives. Information provided by non-traditional, textualdata sources, e.g., API logs, outage updates, emails, and incident reports, that manifest outage and issues on third-party-managed infrastructures, become critical in infrastructure performance analytics. Today’s performance-management tools primarily use numerical network-traffic-related data and limited textual data such as syslogs in silos. Mining pertinent information from textual log/event data, and correlating them with numerical performance data in unison on the same analytics platform will lead to much faster troubleshooting of application/service infrastructure performance issues. Considering these realities, in this Phase I SBIR project, Ennetix will develop a novel, log-driven infras- tructure analytics and management service, called LIAM, to enhance availability, security, and performance of distributed infrastructures, and greatly accelerate root-cause analysis of infratructure problems. LIAM will mine non-traditional textual data, such as system/server logs, configuration change logs, outage reports, event reports from other IT service management products, etc.; and correlate them with numerical network trace and server/host performance data. LIAM will feature advanced machine-learning techniques based on topic mining, novelty detection, and clustering; and it will be built on a scalable architecture to accom- modate other user-defined categorical data sources. LIAM will bring useful additional context to analyzing performance anomalies to reduce application/service interruptions, and accelerate root-cause identification and service restoration. The proposed solution will greatly benefit IT administrators and managers at DOE and other government organizations through a new approach for infrastructure performance management in today’s cloud-based, dynamic, and distributed IT infrastructures. The wider benefits of this effort will extend well beyond the immediate DOE scientific community, and on to other enterprises, network operators, and cloud-service providers. In particular, many digital enterprises and commercial cloud-service providers can leverage the proposed service to proactively troubleshoot performance issues for their distributed application/service de- livery infrastructures.

Phase II

Contract Number: DE-SC0021575
Start Date: 4/4/2022    Completed: 4/3/2024
Phase II year
2022
Phase II Amount
$1,650,000
Digital transformation of enterprises and emergence of cloud-delivered applications and services have created virtualized, dynamic, and distributed IT infrastructures. Assuring availability, security, and performance in such an environment poses a real challenge to IT departments. Traditional IT Ops has given way to DevOps to speed up IT’s service response to rapidly changing demands from their stakeholders. The rate of configuration changes in a DevOps environment is an order of magnitude greater than in a traditional IT Ops environment. Now, IT organizations are trying to leverage machine learning and advanced analytics to further automate and improve responsiveness of infrastructure services. This new trend is referred to as Artificial Intelligence for IT Ops (AIOps). DevOps environment which uses automated provisioning and software- defined orchestration cannot ignore the impact of frequent configuration changes/updates (manifested in system/server logs) on application infrastructure performance. Information provided by non-traditional, textual data sources, e.g., syslogs, API logs, outage reports, etc. that manifest as issues on infrastructures, become critical in infrastructure performance analytics. Today’s performance-management tools primarily use numerical network-traffic-related data and limited textual data such as syslogs in silos. Mining pertinent information from textual log/event data and correlating them with numerical performance data on the same analytics platform will lead to faster troubleshooting of application/service infrastructure performance issues. Considering these realities, in this Phase II SBIR project, Ennetix will develop a novel, log-driven infrastructure analytics and management service, called LIAM, to enhance availability, security, and performance of modern IT infrastructures, and greatly accelerate root-cause analysis of issues. LIAM will mine non- traditional textual data, such as system/server logs, configuration change logs, outage reports, and event re- ports from other IT management platforms; and correlate them with numerical network trace and server/host performance data. LIAM will feature advanced machine-learning techniques based on topic mining, novelty detection, and clustering; and it will be built on a scalable architecture to accommodate other user-defined categorical data sources. LIAM will bring useful additional context to analyzing performance anomalies to reduce application/service interruptions and accelerate root-cause identification and service restoration. During Phase I of this SBIR project, requirements analysis and design of the LIAM platform were conducted, a working prototype was developed, and evaluation studies have been performed to determine LIAM’s effectiveness to support IT operations by faster root-cause analytics and troubleshooting of modern IT infrastructures. These feasibility and performance evaluation studies have been accomplished using live data gathered from a large campus IT infrastructure (namely, UC Davis). Outcomes of the Phase I R&D efforts and evaluation studies have confirmed the viability of LIAM as a commercial-grade solution. In this Phase II project (as a continuation of Phase I), the goal is to significantly expand LIAM with analytical features, AI/ML models, third-party integrations, automation methods, and innovative visualizations. A commercial-grade LIAM solution will be developed using which IT operations team can proactively manage the performance of distributed infrastructures. Early trials will be accomplished to demonstrate the functionalities and performance of LIAM on live networks and pave the way to successful market entry and deployment on premier R&E organizations such as UC Davis. The proposed solution will greatly benefit IT administrators and managers at DOE and other organizations through a new approach for IT management which considers various data sources (both textual and numerical) along with traffic data and significantly reduces operational expenditures. The wider benefits of this effort will extend well beyond the immediate DOE scientific community, and on to other enterprises, network operators, and cloud-service providers, who will be able to leverage the proposed LIAM solution to proactively manage their cloud-based, distributed, and dynamic application-delivery infrastructures.