Useful functions

Reading Spectra from a File

For ease of use, a function named read_one_spectrum is provided in the ms_entropy package, allowing you to easily read spectra from a file. Here is an example of how you can use it:

from ms_entropy import read_one_spectrum
for spectrum in read_one_spectrum('path/to/spectrum/file'):
    print(spectrum)

This function returns a dictionary, where each key-value pair corresponds to a specific metadata of the spectrum.

Currently, the read_one_spectrum function supports the following file formats: .mgf, .msp, .mzML, and .lbm2 from the MS-DIAL software.


Get the top-n results from the Flash entropy search results

Once you have conducted a search in your spectral library, you may want to focus only on the top-N results, or the results with a similarity score that is higher than a certain threshold. The get_topn_matches function has been designed specifically for this purpose.

The get_topn_matches function takes three parameters:

  • similarity_array: The similarity scores that the search function has returned.

  • topn: The number of top results you want to retrieve. If you set this to None, all results will be retrieved.

  • min_similarity: The minimum similarity score that results should have. If you set this to None, all results will be retrieved.

The function will return a list of dictionaries. Each dictionary corresponds to a spectrum in the library. The dictionary is similar to the one in the library spectra (the input of the build_index), with the addition of an entropy_similarity key to store the similarity score of the spectrum.

Here’s how you can use the get_topn_matches function:

topn_match = entropy_search.get_topn_matches(entropy_similarity, topn=3, min_similarity=0.01)

This example will return a list of the top 3 matches with a similarity score greater than 0.01.


Get the metadata of a specifical spectrum from the Flash entropy search object

After you’ve conducted a search in your spectral library, you may want to retrieve the metadata of a specific spectrum. For this, you can use the __getitem__ function.

For instance, let’s say that after a search, you found that the third spectrum (index start from 0) in the library has the highest similarity score. You can call entropy_search[2] to retrieve the metadata of the third spectrum.

Here’s an example of how you can use the __getitem__ function:

from ms_entropy import FlashEntropySearch
entropy_search = FlashEntropySearch()
entropy_search.build_index(spectral_library)

# Get the metadata of the third spectrum
metadata = entropy_search[2]

The metadata was extracted and stored when you called the build_index function. The data will remain available even if you save and reload the index using either the pickle module or the read and write functions.


Get the matched peaks number of query spectrum to the library Spectra

If you also want to know the number of matched peaks between the query spectrum and the library spectra, you can set the get_matched_peaks_number parameters to True. Then, the returned results will be a list of two numpy arrays. The first array contains the similarity scores, and the second array contains the number of matched peaks.

At this moment, the get_matched_peaks_number parameter is only supported by the identity_search, open_search, and neutral_loss_search functions.

Here’s an example of how you can use the get_matched_peaks_number parameter:

import numpy as np
from ms_entropy import FlashEntropySearch

spectral_library = [{
    "id": "Demo spectrum 1",
    "precursor_mz": 150.0,
    "peaks": [[100.0, 1.0], [101.0, 1.0], [103.0, 1.0]]
}, {
    "id": "Demo spectrum 2",
    "precursor_mz": 200.0,
    "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32),
    "metadata": "ABC"
}, {
    "id": "Demo spectrum 3",
    "precursor_mz": 250.0,
    "peaks": np.array([[200.0, 1.0], [101.0, 1.0], [202.0, 1.0]], dtype=np.float32),
    "XXX": "YYY",
}, {
    "precursor_mz": 350.0,
    "peaks": [[100.0, 1.0], [101.0, 1.0], [302.0, 1.0]]}]
query_spectrum = {"precursor_mz": 150.0,
                  "peaks": [[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]]}

entropy_search = FlashEntropySearch()
# Step 1: Build the index from the library spectra
spectral_library = entropy_search.build_index(spectral_library)
# Step 2: Clean the query spectrum
query_spectrum['peaks'] = entropy_search.clean_spectrum_for_search(
    precursor_mz = query_spectrum['precursor_mz'],
    peaks = query_spectrum['peaks']
)
# Step 3: Search the library
# This parameter is supported by the identity_search, open_search, and neutral_loss_search functions
entropy_similarity, matched_peaks_number = entropy_search.identity_search(
    precursor_mz = query_spectrum['precursor_mz'],
    peaks = query_spectrum['peaks'],
    ms1_tolerance_in_da = 0.01,
    ms2_tolerance_in_da = 0.02,
    output_matched_peak_number = True
)
print(entropy_similarity)
print(matched_peaks_number)

Save and load index for the Flash entropy search object

After you have built the index, you have the option to save it to disk for later use.

Using pickle

You can use Python’s built-in pickle module to save and load the FlashEntropySearch object, as follows:

import pickle
# Save the index
with open('path/to/index', 'wb') as f:
    pickle.dump(entropy_search, f)
# And load the index
with open('path/to/index', 'rb') as f:
    entropy_search = pickle.load(f)

Using read and write functions

We also provide read and write functions to save and load the index.

To save a FlashEntropySearch object to disk:

entropy_search.write('path/to/index')

To load a FlashEntropySearch object from disk:

entropy_search = FlashEntropySearch()
entropy_search.read('path/to/index')

If you’re working with a very large spectral library, or your computer’s memory is limited, you can use the low_memory parameter to partially load the library and reduce the memory usage. For example:

entropy_search = FlashEntropySearch(low_memory=True)
entropy_search.read('path/to/index')

The index only needs to be built once. After that, you can use the read function to load the index. If you built the index using the low_memory=False mode, you can still load it using a FlashEntropySearch object with either the low_memory=False or low_memory=True mode.