Basic usage

Flash Entropy Search is an efficient algorithm designed for quickly searching large spectral libraries. For details, see the reference: Li, Y., Fiehn, O., Flash entropy search to query all mass spectral libraries in real time. 04 April 2023, PREPRINT (Version 1) available at Research Square..

Step 0: Prepare the library spectra

Before you start, you need to prepare your library spectra. This data should be represented as a list of dictionaries, where each dictionary corresponds to a spectrum. Below is an example demonstrating the appropriate format for the library spectra:

import numpy as np
spectral_library =  [{
    "id": "Demo spectrum 1",
    "precursor_mz": 150.0,
    "peaks": [[100.0, 1.0], [101.0, 1.0], [103.0, 1.0]]
}, {
    "id": "Demo spectrum 2",
    "precursor_mz": 200.0,
    "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32),
    "metadata": "ABC"
}, {
    "id": "Demo spectrum 3",
    "precursor_mz": 250.0,
    "peaks": np.array([[200.0, 1.0], [101.0, 1.0], [202.0, 1.0]], dtype=np.float32),
    "XXX": "YYY",
}, {
    "precursor_mz": 350.0,
    "peaks": [[100.0, 1.0], [101.0, 1.0], [302.0, 1.0]]}]

In this format, each dictionary must contain the keys precursor_mz and peaks. The precursor_mz key represents the precursor m/z, which should be a float. The peaks key represents the peaks of the spectrum, which can be a list of lists or a numpy array in the format [[mz, intensity], [mz, intensity], ...].

All other keys are optional and can be used to store any metadata associated with the spectrum. For instance, the id key could be used to identify the spectrum, while other keys could hold additional information.

Step 1: Building the index

The initial step in using the Flash Entropy Search algorithm involves building an index for the library spectra. The build_index function of the FlashEntropySearch class helps you accomplish this. This function takes your spectral_library as an argument. The spectra in the spectral_library do not require any pre-processing, as the index building process will handle this automatically.

from ms_entropy import FlashEntropySearch
entropy_search = FlashEntropySearch()
entropy_search.build_index(spectral_library)

Warning

It is important to note that for efficient identity searching, all spectra in the spectral_library get re-sorted based on their precursor m/z values during the indexing process. The build_index function returns a list of these re-sorted spectra, useful for mapping results back to their original order.

If you need to maintain the original order of spectra, consider adding metadata, such as an id field, to your spectra. This metadata will help keep track of the original sequence even after re-sorting.

Note

If you want to calculate unweighted entropy similarity instead of weighted entropy similarity, you can set the intensity_weight parameter to None when constructing the FlashEntropySearch object. For example:

entropy_search = FlashEntropySearch(intensity_weight=None)
entropy_search.build_index(spectral_library)
...

Step 2: Searching the library

After building the index, you can search your query spectrum against the library spectra using the search function. This function takes the query spectrum as input, returning the similarity score of each spectrum in the library, in the same order as the spectral library returned by the build_index function.

The search function accepts the following parameters:

  • precursor_mz: The precursor m/z of the query spectrum.

  • peaks: The peaks of the query spectrum, which can be either a list of lists or a numpy array, which in format of [[mz, intensity], [mz, intensity], ...].

  • ms1_tolerance_in_da: The mass tolerance to apply to the precursor m/z in Da, used only for identity search.

  • ms2_tolerance_in_da: The mass tolerance to apply to the fragment ions in Da.

  • method: The search method to employ. Available methods include identity, open, neutral_loss, and hybrid. You can use a string or a list/set of these four strings like {'identity', 'open'}. Use all to apply all four methods. The default value is all.

  • target: Where to perform the similarity calculation, on the cpu or gpu. The default value is cpu.

  • precursor_ions_removal_da: The mass tolerance for removing the precursor ions in Da. Fragment ions with m/z larger than precursor_mz - precursor_ions_removal_da will be removed. Based on our tests, removing precursor ions can enhance search performance.

  • noise_threshold: The intensity threshold for removing noise peaks. Peaks with intensity smaller than noise_threshold * max(fragment ion's intensity) will be removed.

  • max_num_peaks: Keep only the top max_num_peaks peaks with the highest intensity.

To run the search function, use the following code:

entropy_similarity = entropy_search.search(
    precursor_mz = 150.0,
    peaks = [[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]]
)

The search function returns a dictionary, where the key is the search method and the value is a list of similarity scores. The scores align with the order of the spectral library returned by the build_index function. An example of the results is shown below:

{
    'identity_search': [0.0, 0.5, 0.0, 0.8],
    'open_search': [0.0, 0.0, 0.3, 0.8],
    'neutral_loss_search': [0.2, 0.0, 0.7, 0.0],
    'hybrid_search': [0.2, 0.5, 1.0, 0.8]
}

Alternative: individual search functions

Instead of using the search function that automatically (1) cleans the query spectrum and (2) performs the library search, you have the option to manually perform these tasks in two separate steps. You can use the clean_spectrum_for_search function to clean the query spectrum and then use individual search functions to search the library. Both approaches are equivalent, and you can choose the one that suits you best.

Clean the query spectrum

Before performing a library search, the query spectrum should be pre-processed using the clean_spectrum_for_search function. This function accomplishes the following:

  1. Remove empty peaks (m/z <= 0 or intensity <= 0).

  2. Remove peaks with m/z values greater than precursor_mz - precursor_ions_removal_da (removes precursor ions to improve the quality of spectral comparison).

  3. Centroid the spectrum by merging peaks within +/- min_ms2_difference_in_da and sort the resulting spectrum by m/z.

  4. Remove peaks with intensity less than noise_threshold * maximum intensity.

  5. Retain only the top max_peak_num peaks and remove all others.

  6. Normalize the intensity to sum to 1.

Assuming you have your query spectrum as:

query_spectrum = {"precursor_mz": 150.0,
                  "peaks": [[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]]}

To utilize the clean_spectrum_for_search function, call it on your query spectrum, passing in the relevant parameters:

query_spectrum['peaks'] = entropy_search.clean_spectrum_for_search(
    precursor_mz = query_spectrum['precursor_mz'],
    peaks = query_spectrum['peaks']
)

We also provide a separate function called clean_spectrum that performs the same cleaning steps as clean_spectrum_for_search. Here’s how to call this function:

from ms_entropy import clean_spectrum
precursor_ions_removal_da = 1.6
query_spectrum['peaks'] = clean_spectrum(
    peaks = query_spectrum['peaks'],
    max_mz = query_spectrum['precursor_mz'] - precursor_ions_removal_da
)

Both of these functions serve the same purpose and can be used interchangeably. You can select the one that suits your needs.

Performing library search using individual search functions

There are four individual search functions available for library searching:

  • identity_search for Identity search

  • open_search for Open search

  • neutral_loss_search for Neutral loss search

  • hybrid_search for Hybrid search

Each of these functions takes the pre-cleaned query spectrum as input, along with the spectral library index built in Step 1, and returns the similarity score for each spectrum in the library, in the same order as the spectral library that was returned by the set_library_spectra function.

Warning

Remember, when using any of the four individual search functions, the peaks must be pre-processed by either the clean_spectrum_for_search or the clean_spectrum function. Failure to do so will result in an error.

Each search function accepts the following parameters:

  • precursor_mz: The precursor m/z of the query spectrum.

  • peaks: The peaks of the query spectrum.

  • ms1_tolerance_in_da: The mass tolerance to use for the precursor m/z in Da.

  • ms2_tolerance_in_da: The mass tolerance to use for the fragment ions in Da.

  • target: Specifies whether to run the similarity calculation on CPU or GPU. The default value is cpu.

Here’s an example of how you can use these functions:

# Identity search
entropy_similarity = entropy_search.identity_search(
    precursor_mz = query_spectrum['precursor_mz'],
    peaks = query_spectrum['peaks'],
    ms1_tolerance_in_da = 0.01,
    ms2_tolerance_in_da = 0.02
)

# Open search
entropy_similarity = entropy_search.open_search(
    peaks = query_spectrum['peaks'],
    ms2_tolerance_in_da = 0.02
)

# Neutral loss search
entropy_similarity = entropy_search.neutral_loss_search(
    precursor_mz = query_spectrum['precursor_mz'],
    peaks = query_spectrum['peaks'],
    ms2_tolerance_in_da = 0.02
)

# Hybrid search
entropy_similarity = entropy_search.hybrid_search(
    precursor_mz = query_spectrum['precursor_mz'],
    peaks = query_spectrum['peaks'],
    ms2_tolerance_in_da = 0.02
)