Software identification is an important capability, especially in larger organizations where a multi-vendor environment that grows over time can rapidly become tangled and interconnected in sometimes unexpected ways. Keeping accurate documentation helps, but isn’t always foolproof. IT managers need a way to figure out which versions of which software products are running in their environment. Correct identification is necessary for cybersecurity, software management and modernization planning, among other critical matters.
There are several broadly used methods for identifying installed software packages, most notably Common Platform Enumerations (CPE), Package URLs (PURL) and Software Identification Tags (SWID). This variety is a mixed blessing; the lack of common standards among vendors – open-source and commercial providers alike – complicates the challenge.
From our perspective, the key is for an organization to choose software asset inventory tools and vulnerability tools that are interoperable in terms of software product identification. We also believe artificial intelligence, and particularly machine learning, hold tremendous promise to accelerate and enable accurate software identification. Current methods that entail manually mapping software inventory data with a standardized list of software products are inefficient and susceptible to errors. The current limitations are due to a range of issues, including the varied naming conventions used by different vendors and scalability issues due to the large data volumes.
GMU students find hybrid machine-learning approach
Over the 2023-2024 academic year, we sponsored a cohort of George Mason University's Cybersecurity Engineering seniors on developing machine learning models to correlate software inventory data gathered from several cybersecurity tools with a pre-defined dictionary of known software products. We provided the GMU students an Amazon SageMaker environment to develop these machine learning techniques.
Large data sets were used by the students, including a curated dictionary of software data that is representative of the curated data used within multiple Federal cybersecurity programs. The students took time to thoroughly study the datasets provided by CGI, which included software inventory data gathered from several cybersecurity tools. Ultimately, the students determined that a hybrid approach, blending machine learning and a non-machine learning method (using fuzzy matching logic to winnow a list down to a single match) was the most effective.
The capstone project [link to research paper] showed great promise in providing a potential solution to the software identification challenge and represents a significant step towards using machine learning to address challenges by increasing the efficiency and effectiveness of managing software inventory data. By combining the unsupervised learning model with the fuzzy matching logic, the results showed a highly effective match rate between cybersecurity tool data with a standardized software product list. This project is just a steppingstone; further improvements can be explored to enhance software identification in our evolving cybersecurity landscape.
To learn more about the project and take a deeper dive into the methodology, tools, and processes, read our white paper.