Entropy similarity

Before initiating entropy similarity calculations, it is critical to first clean the spectrum. Particularly, it is highly recommended to remove any noise ions. Our tests on the NIST20 and Massbank.us databases have shown that eliminating ions with m/z values higher than the precursor ion’s m/z - 1.6 considerably enhances the spectral identification performance.

We offer calculate_entropy_similarity and calculate_unweighted_entropy_similarity functions to calculate the entropy similarity between two spectra. By default, these functions centroid the spectra and remove noise ions with intensities lower than 1% of the highest intensity ion. If this behavior isn’t suited to your needs, parameters can be adjusted; go to the References section for more information.

Warning

The calculate_entropy_similarity and calculate_unweighted_entropy_similarity functions clean the spectra by default. If you prefer not to clean the spectra, please set clean_spectra to False. However, when you opt for this, ensure the spectra are pre-centroided, the minimum m/z difference between any two peaks in one spectrum is larger than 2 * ms2_tolerance_in_da, and in every spectrum, the sum of all peaks’ intensity is 1. Otherwise, the results may be incorrect.

import numpy as np
import ms_entropy as me

peaks_query = np.array([[69.071, 7.917962], [86.066, 1.021589], [86.0969, 100.0]], dtype = np.float32)
peaks_reference = np.array([[41.04, 37.16], [69.07, 66.83], [86.1, 999.0]], dtype = np.float32)

# Calculate unweighted entropy similarity.
unweighted_similarity = me.calculate_unweighted_entropy_similarity(peaks_query, peaks_reference, ms2_tolerance_in_da = 0.05)
print(f"Unweighted entropy similarity: {unweighted_similarity}.")

# Calculate entropy similarity.
similarity = me.calculate_entropy_similarity(peaks_query, peaks_reference, ms2_tolerance_in_da = 0.05)
print(f"Entropy similarity: {similarity}.")

References

ms_entropy.calculate_entropy_similarity(peaks_a, peaks_b, ms2_tolerance_in_da: float = 0.02, ms2_tolerance_in_ppm: float = -1, clean_spectra: bool = True, **kwargs)[source]

Calculate the entropy similarity between two spectra.

First, the spectra are cleaned by the clean_spectrum() function.

Then, the entropy based intensity weights are applied to the peaks.

Finally, the entropy similarity is calculated by the calculate_unweighted_entropy_similarity() function.

The formula for entropy similarity is as follows:

\[\begin{split}Similarity = \frac{1}{2} \sum_{i,j} \begin{cases} 0 & \text{ if } mz_{A,i} \neq mz_{B,j} \\ f(I_{A,i}+I_{B,j}) - f(I_{A,i}) - f(I_{B,j}) & \text{ if } mz_{A,i} = mz_{B,j} \end{cases}\end{split}\]

\[\text{ where } f(x) = x \log_2(x) \text{ and } \sum_{i} I_{A,i} = \sum_{j} I_{B,j} = 1\]

Parameters:

peaks_anp.ndarray in shape (n_peaks, 2), np.float32 or list[list[float, float]]

The first spectrum to calculate entropy similarity for. The first column is m/z, and the second column is intensity.

peaks_bnp.ndarray in shape (n_peaks, 2), np.float32 or list[list[float, float]]

The second spectrum to calculate entropy similarity for. The first column is m/z, and the second column is intensity.

ms2_tolerance_in_dafloat, optional

The MS2 tolerance in Da. Defaults to 0.02. If this is set to a negative value, ms2_tolerance_in_ppm will be used instead.

ms2_tolerance_in_ppmfloat, optional

The MS2 tolerance in ppm. Defaults to -1. If this is set to a negative value, ms2_tolerance_in_da will be used instead.

Note: Either ms2_tolerance_in_da or ms2_tolerance_in_ppm must be positive. If both ms2_tolerance_in_da and ms2_tolerance_in_ppm are positive, ms2_tolerance_in_ppm will be used.

clean_spectrabool, optional

Whether to clean the spectra before calculating entropy similarity. Defaults to True. Only set this to False if the spectra have been preprocessed by the `clean_spectrum()` function! Otherwise, the results will be incorrect. If the spectra are already cleaned, set this to False to save time.

**kwargsoptional

The arguments and keyword arguments to pass to function clean_spectrum().

_

Returns:

float: The entropy similarity between the two spectra.

ms_entropy.calculate_unweighted_entropy_similarity(peaks_a, peaks_b, ms2_tolerance_in_da: float = 0.02, ms2_tolerance_in_ppm: float = -1, clean_spectra: bool = True, **kwargs)[source]

Calculate the unweighted entropy similarity between two spectra.

The formula for unweighted entropy similarity is as follows:

\[\begin{split}Similarity = \frac{1}{2} \sum_{i,j} \begin{cases} 0 & \text{ if } mz_{A,i} \neq mz_{B,j} \\ f(I_{A,i}+I_{B,j}) - f(I_{A,i}) - f(I_{B,j}) & \text{ if } mz_{A,i} = mz_{B,j} \end{cases}\end{split}\]

\[\text{ where } f(x) = x \log_2(x) \text{ and } \sum_{i} I_{A,i} = \sum_{j} I_{B,j} = 1\]

Parameters:

peaks_anp.ndarray in shape (n_peaks, 2), np.float32 or list[list[float, float]]

The first spectrum to calculate unweighted entropy similarity for. The first column is m/z, and the second column is intensity.

peaks_bnp.ndarray in shape (n_peaks, 2), np.float32 or list[list[float, float]]

The second spectrum to calculate unweighted entropy similarity for. The first column is m/z, and the second column is intensity.

ms2_tolerance_in_dafloat, optional

The MS2 tolerance in Da. Defaults to 0.02. If this is set to a negative value, ms2_tolerance_in_ppm will be used instead.

ms2_tolerance_in_ppmfloat, optional

The MS2 tolerance in ppm. Defaults to -1. If this is set to a negative value, ms2_tolerance_in_da will be used instead.

Note: Either ms2_tolerance_in_da or ms2_tolerance_in_ppm must be positive. If both ms2_tolerance_in_da and ms2_tolerance_in_ppm are positive, ms2_tolerance_in_ppm will be used.

clean_spectrabool, optional

Whether to clean the spectra before calculating unweighted entropy similarity. Defaults to True. Only set this to False if the spectra have been preprocessed by the clean_spectrum() function! Otherwise, the results will be incorrect. If the spectra are already cleaned, set this to False to save time. If the spectra are in the list format, always set this to True or an error will be raised.

**kwargsoptional

The arguments and keyword arguments to pass to function clean_spectrum().

_

Returns:

float: The unweighted entropy similarity between the two spectra.

ms_entropy.apply_weight_to_intensity(peaks: ndarray) → ndarray[source]

Apply a weight to the intensity of a spectrum based on spectral entropy based on the method described in:

Li, Y., Kind, T., Folz, J. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat Methods 18, 1524-1531 (2021). https://doi.org/10.1038/s41592-021-01331-z.

Parameters:

peaksnp.ndarray in shape (n_peaks, 2), np.float32

The spectrum to apply weight to. The first column is m/z, and the second column is intensity. The peaks need to be pre-cleaned.

_

Returns:

np.ndarray in shape (n_peaks, 2), np.float32: The spectrum with weight applied. The first column is m/z, and the second column is intensity. The peaks will be a copy of the input peaks.