API Reference

class ms_entropy.FlashEntropySearch(max_ms2_tolerance_in_da=0.024, mz_index_step=0.0001, low_memory=False, path_data=None, intensity_weight='entropy', **kwargs)[source]

Bases: object

Initialize the EntropySearch class.

Parameters:

max_ms2_tolerance_in_da – The maximum MS2 tolerance in Da.
mz_index_step – The step size for the m/z index.
low_memory – The memory usage mode, can be 0, 1, or 2. 0 means normal mode, 1 means low memory mode, and 2 means medium memory mode.
path_data – The path to save the index data.
intensity_weight – The weight for the intensity in the entropy calculation, can be “entropy” or None. Default is “entropy”. - None: The intensity will not be weighted, then the unweighted similarity will be calculated. - “entropy”: The intensity will be weighted by the entropy, then the entropy similarity will be calculated.
kwargs – Those parameters will be ignored.

build_index(all_spectra_list: list = None, max_indexed_mz: float = 1500.00005, precursor_ions_removal_da: float = 1.6, noise_threshold=0.01, min_ms2_difference_in_da: float = 0.05, max_peak_num: int = 0, clean_spectra: bool = True)[source]

Set the library spectra for entropy search.

The all_spectra_list must be a list of dictionaries, with each dictionary containing at least two keys: “precursor_mz” and “peaks”. The dictionary should be in the format of {“precursor_mz”: precursor_mz, “peaks”: peaks, …}, All keys in the dictionary, except “peaks,” will be saved as the metadata and can be accessed using the __getitem__ function (e.g. entropy_search[0] returns the metadata for the first spectrum in the library).

The precursor_mz is the precursor m/z value of the MS/MS spectrum;

The peaks is an numpy array or a nested list of the MS/MS spectrum, looks like [[mz1, intensity1], [mz2, intensity2], …].

Parameters:

all_spectra_list – A list of dictionaries in the format of {“precursor_mz”: precursor_mz, “peaks”: peaks}, the spectra in the list do not need to be sorted by the precursor m/z. This function will sort the spectra by the precursor m/z and output the sorted spectra list.
max_indexed_mz – The maximum m/z value that will be indexed. Default is 1500.00005.
precursor_ions_removal_da – The ions with m/z larger than precursor_mz - precursor_ions_removal_da will be removed. Default is 1.6. Set to None to not remove any ions.
noise_threshold – The intensity threshold for removing the noise peaks. The peaks with intensity smaller than noise_threshold * max(intensity) will be removed. Default is 0.01.
min_ms2_difference_in_da – The minimum difference between two peaks in the MS/MS spectrum. Default is 0.05.
max_peak_num – The maximum number of peaks in the MS/MS spectrum. Default is 0, which means no limit.
clean_spectra – If True, the spectra will be cleaned before indexing. Default is True. If ALL spectra in the library are pre-cleaned with the function clean_spectrum or clean_spectrum_for_search, set this parameter to False. ALWAYS set this parameter to true if the spectra are not pre-prepossessed with the function clean_spectrum or clean_spectrum_for_search.

Returns:

If the all_spectra_list is provided, this function will return the sorted spectra list.

clean_spectrum_for_search(precursor_mz, peaks, precursor_ions_removal_da: float = 1.6, noise_threshold=0.01, min_ms2_difference_in_da: float = 0.05, max_peak_num: int = 0)[source]

Clean the MS/MS spectrum, need to be called before any search.

Parameters:

precursor_mz – The precursor m/z of the spectrum.
peaks – The peaks of the spectrum, should be a list or numpy array with shape (N, 2), N is the number of peaks. The format of the peaks is [[mz1, intensity1], [mz2, intensity2], …].
precursor_ions_removal_da – The ions with m/z larger than precursor_mz - precursor_ions_removal_da will be removed. Default is 1.6. Set to None to not remove any ions.
noise_threshold – The intensity threshold for removing the noise peaks. The peaks with intensity smaller than noise_threshold * max(intensity) will be removed. Default is 0.01.
min_ms2_difference_in_da – The minimum difference between two peaks in the MS/MS spectrum. Default is 0.05.
max_peak_num – The maximum number of peaks in the MS/MS spectrum. Default is 0, which means no limit.

get_topn_matches(similarity_array, topn=3, min_similarity=0.01)[source]

Get the topn MS/MS spectra with the highest entropy similarity.

Parameters:

similarity_array – The entropy similarity of the MS/MS spectra.
topn – The number of MS/MS spectra to return, if None, all the MS/MS spectra will be returned.
min_similarity – The minimum similarity of the MS/MS spectra to return, if None, all the MS/MS spectra will be returned.

Returns:

The topn MS/MS spectra with the highest entropy similarity.

hybrid_search(precursor_mz, peaks, ms2_tolerance_in_da, target='cpu', **kwargs)[source]

Run the hybrid search, the query spectrum should be preprocessed by clean_spectrum() function before calling this function.

Parameters:

precursor_mz – The precursor m/z of the query spectrum.
peaks – The peaks of the query spectrum, should be the output of clean_spectrum() function.
ms2_tolerance_in_da – The MS2 tolerance in Da.
target – The target device for the search, can be “cpu” or “gpu”.

Returns:

The entropy similarity score for each spectrum in the library, a numpy array with shape (N,), N is the number of spectra in the library.

identity_search(precursor_mz, peaks, ms1_tolerance_in_da, ms2_tolerance_in_da, target='cpu', output_matched_peak_number=False, **kwargs)[source]

Run the identity search, the query spectrum should be preprocessed by clean_spectrum() function before calling this function.

For super large spectral library, directly identity search is not recommended. To do the identity search on super large spectral library, divide the spectral library into several parts, build the index for each part, and then do the identity search on each part will be much faster.

Parameters:

precursor_mz – The precursor m/z of the query spectrum.
peaks – The peaks of the query spectrum, should be the output of clean_spectrum() function.
ms1_tolerance_in_da – The MS1 tolerance in Da.
ms2_tolerance_in_da – The MS2 tolerance in Da.
target – The target device for the search, can be “cpu” or “gpu”.
output_matched_peak_number – If True, the number of matched peaks will be returned with the entropy similarity score.

Returns:

The entropy similarity score for each spectrum in the library, a numpy array with shape (N,), N is the number of spectra in the library. If output_matched_peak_number is True, the number of matched peaks will be returned with the entropy similarity score, i.e. the return will be a tuple of two numpy arrays, the first one is the entropy similarity score, and the second one is the number of matched peaks.

neutral_loss_search(precursor_mz, peaks, ms2_tolerance_in_da, target='cpu', output_matched_peak_number=False, **kwargs)[source]

Run the neutral loss search, the query spectrum should be preprocessed by clean_spectrum() function before calling this function.

Parameters:

precursor_mz – The precursor m/z of the query spectrum.
peaks – The peaks of the query spectrum, should be the output of clean_spectrum() function.
ms2_tolerance_in_da – The MS2 tolerance in Da.
target – The target device for the search, can be “cpu” or “gpu”.
output_matched_peak_number – If True, the number of matched peaks will be returned with the entropy similarity score.

Returns:

The entropy similarity score for each spectrum in the library, a numpy array with shape (N,), N is the number of spectra in the library. If output_matched_peak_number is True, the number of matched peaks will be returned with the entropy similarity score, i.e. the return will be a tuple of two numpy arrays, the first one is the entropy similarity score, and the second one is the number of matched peaks.

open_search(peaks, ms2_tolerance_in_da, target='cpu', output_matched_peak_number=False, **kwargs)[source]

Run the open search, the query spectrum should be preprocessed by clean_spectrum() function before calling this function.

Parameters:

peaks – The peaks of the query spectrum, should be the output of clean_spectrum() function.
ms2_tolerance_in_da – The MS2 tolerance in Da.
target – The target device for the search, can be “cpu” or “gpu”.
output_matched_peak_number – If True, the number of matched peaks will be returned with the entropy similarity score.

Returns:

The entropy similarity score for each spectrum in the library, a numpy array with shape (N,), N is the number of spectra in the library. If output_matched_peak_number is True, the number of matched peaks will be returned with the entropy similarity score, i.e. the return will be a tuple of two numpy arrays, the first one is the entropy similarity score, and the second one is the number of matched peaks.

read(path_data=None)[source]

Read the MS/MS spectral library from a file.

Parameters:: path_data – The path of the file to read.
Returns:: None

save_memory_for_multiprocessing()[source]

Save the memory for multiprocessing. This function will move the numpy array in the index to shared memory in order to save memory.

This function is not required when you only use one thread to search the MS/MS spectra. When use multiple threads, this function is also not required but highly recommended, as it avoids the memory copy and saves a lot of memory and time.

Returns:: None

search(precursor_mz, peaks, ms1_tolerance_in_da=0.01, ms2_tolerance_in_da=0.02, method='all', target='cpu', precursor_ions_removal_da: float = 1.6, noise_threshold=0.01, min_ms2_difference_in_da: float = 0.05, max_peak_num: int = None)[source]

Run the Flash entropy search for the query spectrum.

Parameters:

precursor_mz – The precursor m/z of the query spectrum.
peaks – The peaks of the query spectrum, should be a list or numpy array with shape (N, 2), N is the number of peaks. The format of the peaks is [[mz1, intensity1], [mz2, intensity2], …].
ms1_tolerance_in_da – The MS1 tolerance in Da. Default is 0.01.
ms2_tolerance_in_da – The MS2 tolerance in Da. Default is 0.02.
method – The search method, can be “identity”, “open”, “neutral_loss”, “hybrid”, “all”, or list of the above.
target – The target device for the search, can be “cpu” or “gpu”.
precursor_ions_removal_da – The ions with m/z larger than precursor_mz - precursor_ions_removal_da will be removed. Default is 1.6.
noise_threshold – The intensity threshold for removing the noise peaks. The peaks with intensity smaller than noise_threshold * max(intensity) will be removed. Default is 0.01.
min_ms2_difference_in_da – The minimum difference between two peaks in the MS/MS spectrum. Default is 0.05.
max_peak_num – The maximum number of peaks in the MS/MS spectrum. Default is None, which means no limit.

Returns:

A dictionary with the search results. The keys are “identity_search”, “open_search”, “neutral_loss_search”, “hybrid_search”, and the values are the search results for each method.

write(path_data=None)[source]

Write the MS/MS spectral library to a file.

Parameters:: path_data – The path of the file to write.
Returns:: None

class ms_entropy.FlashEntropySearchCore(path_data=None, max_ms2_tolerance_in_da=0.024, mz_index_step=0.0001, intensity_weight='entropy')[source]

Bases: object

Initialize the EntropySearch class.

Parameters:

path_array – The path array of the index files.
max_ms2_tolerance_in_da – The maximum MS2 tolerance used when searching the MS/MS spectra, in Dalton. Default is 0.024.
mz_index_step – The step size of the m/z index, in Dalton. Default is 0.0001. The smaller the step size, the faster the search, but the larger the index size and longer the index building time.
intensity_weight – The weight of the intensity, can be “entropy” or None. If set to “entropy”, the intensity will be weighted by the entropy. If set to None, the intensity will not be weighted, which is equivalent to the unweighted entropy similarity.

build_index(all_spectra_list: list, max_indexed_mz: float = 1500.00005, append: bool = False)[source]

Build the index for the MS/MS spectra library.

The spectra provided to this function should be a dictionary in the format of {“precursor_mz”: precursor_mz, “peaks”: peaks}. The precursor_mz is the precursor m/z value of the MS/MS spectrum; The peaks is a numpy array which has been processed by the function “clean_spectrum”.

Parameters:

all_spectra_list – A list of dictionaries in the format of {“precursor_mz”: precursor_mz, “peaks”: peaks}, the spectra in the list need to be sorted by the precursor m/z.
max_indexed_mz – The maximum m/z value that will be indexed. Default is 1500.00005.
append – Not implemented yet.

read(path_data=None)[source]: Read the index from the specified path.

save_memory_for_multiprocessing()[source]: Move the numpy array in the index to shared memory in order to save memory. This function is not required when you only use one thread to search the MS/MS spectra. When use multiple threads, this function is also not required but highly recommended, as it avoids the memory copy and saves a lot of memory and time.

search(method='open', target='cpu', precursor_mz=None, peaks=None, ms2_tolerance_in_da=0.02, search_type=0, search_spectra_idx_min=0, search_spectra_idx_max=0, output_matched_peak_number=False)[source]

Perform identity-, open- or neutral loss search on the MS/MS spectra library.

Parameters:

method – The search method, can be “open” or “neutral_loss”. Set it to “open” for identity search and open search, set it to “neutral_loss” for neutral loss search.
target – The target to search, can be “cpu” or “gpu”.
precursor_mz – The precursor m/z of the query MS/MS spectrum, required for neutral loss search.
peaks – The peaks of the query MS/MS spectrum. The peaks need to be precleaned by “clean_spectrum” function.
ms2_tolerance_in_da – The MS2 tolerance used when searching the MS/MS spectra, in Dalton. Default is 0.02.
search_type – The search type, can be 0, 1 or 2. Set it to 0 for searching the whole MS/MS spectra library. Set it to 1 for searching a range of the MS/MS spectra library,
search_spectra_idx_min – The minimum index of the MS/MS spectra to search, required when search_type is 1.
search_spectra_idx_max – The maximum index of the MS/MS spectra to search, required when search_type is 1.
output_matched_peak_number – Whether to output the number of matched peaks. Only supported when target is “cpu”. If set to True, the function will return a tuple of (entropy_similarity, matched_peak_number).

search_hybrid(target='cpu', precursor_mz=None, peaks=None, ms2_tolerance_in_da=0.02)[source]

Perform the hybrid search for the MS/MS spectra.

Parameters:

target – The target to perform the search. “cpu” for CPU, “gpu” for GPU.
precursor_mz – The precursor m/z of the MS/MS spectra.
peaks – The peaks of the MS/MS spectra, needs to be cleaned with the “clean_spectrum” function.
ms2_tolerance_in_da – The MS/MS tolerance in Da.

write(path_data=None)[source]: Write the index to the specified path.

class ms_entropy.FlashEntropySearchCoreLowMemory(path_data, max_ms2_tolerance_in_da=0.024, mz_index_step=0.0001, intensity_weight='entropy')[source]

Bases: FlashEntropySearchCore

Initialize the EntropySearch class. This class use file.read function to read the data from the file, which is suitable for very low memory usage.

Parameters:

path_data – The path to save the index data.
max_ms2_tolerance_in_da – The maximum MS2 tolerance in Da.
mz_index_step – The step size for the m/z index.
intensity_weight – The weight for the intensity in the entropy calculation, can be “entropy” or None. Default is “entropy”. - None: The intensity will not be weighted, then the unweighted similarity will be calculated. - “entropy”: The intensity will be weighted by the entropy, then the entropy similarity will be calculated.

read(path_data=None)[source]: Read the index from the file.

search_hybrid(target='cpu', precursor_mz=None, peaks=None, ms2_tolerance_in_da=0.02)[source]

Perform the hybrid search for the MS/MS spectra.

Parameters:

target – The target to perform the search. “cpu” for CPU, “gpu” for GPU.
precursor_mz – The precursor m/z of the MS/MS spectra.
peaks – The peaks of the MS/MS spectra, needs to be cleaned with the “clean_spectrum” function.
ms2_tolerance_in_da – The MS/MS tolerance in Da.

write(path_data=None)[source]: Write the index to the file.

class ms_entropy.FlashEntropySearchCoreMediumMemory(path_data, max_ms2_tolerance_in_da=0.024, mz_index_step=0.0001, intensity_weight='entropy')[source]

Bases: FlashEntropySearchCore

Initialize the EntropySearch class. This class is use memmap function to read data from the disk, which is suitable for most of the cases, unless the data is super large.

Parameters:

path_data – The path to save the index data.
max_ms2_tolerance_in_da – The maximum MS2 tolerance in Da.
mz_index_step – The step size for the m/z index.
intensity_weight – The weight for the intensity in the entropy calculation, can be “entropy” or None. Default is “entropy”. - None: The intensity will not be weighted, then the unweighted similarity will be calculated. - “entropy”: The intensity will be weighted by the entropy, then the entropy similarity will be calculated.

read(path_data=None)[source]: Read the index from the file.

write(path_data=None)[source]: Write the index to the file.

ms_entropy.clean_spectrum(peaks, min_mz: float = -1.0, max_mz: float = -1.0, noise_threshold: float = 0.01, min_ms2_difference_in_da: float = 0.05, min_ms2_difference_in_ppm: float = -1.0, max_peak_num: int = -1, normalize_intensity: bool = True, **kwargs) → ndarray[source]

Clean, centroid, and normalize a spectrum with the following steps:

Remove empty peaks (m/z <= 0 or intensity <= 0).

Remove peaks with m/z >= max_mz or m/z <= min_mz.

Centroid the spectrum by merging peaks within min_ms2_difference_in_da.

Remove peaks with intensity < noise_threshold * max_intensity.

Keep only the top max_peak_num peaks.

Normalize the intensity to sum to 1.

Parameters:

peaksnp.ndarray in shape (n_peaks, 2), dtype=np.float32 or list[list[float, float]]

A 2D array of shape (n_peaks, 2) where the first column is m/z and the second column is intensity.

min_mzfloat, optional

The minimum m/z to keep. Defaults to None, which will skip removing peaks with m/z <= min_mz.

max_mzfloat, optional

The maximum m/z to keep. Defaults to None, which will skip removing peaks with m/z >= max_mz.

noise_thresholdfloat, optional

The minimum intensity to keep. Defaults to 0.01, which will remove peaks with intensity < 0.01 * max_intensity.

min_ms2_difference_in_dafloat, optional

The minimum m/z difference between two peaks in the resulting spectrum. Defaults to 0.05, which will merge peaks within 0.05 Da. If a negative value is given, the min_ms2_difference_in_ppm will be used instead.

min_ms2_difference_in_ppmfloat, optional

The minimum m/z difference between two peaks in the resulting spectrum. Defaults to -1, which will use the min_ms2_difference_in_da instead. If a negative value is given, the min_ms2_difference_in_da will be used instead. ** Note either min_ms2_difference_in_da or min_ms2_difference_in_ppm must be positive. If both are positive, min_ms2_difference_in_ppm will be used. **

max_peak_numint, optional

The maximum number of peaks to keep. Defaults to None, which will keep all peaks.

normalize_intensitybool, optional

Whether to normalize the intensity to sum to 1. Defaults to True. If False, the intensity will be kept as is.

**kwargsoptional

Those keyword arguments will be ignored.

_

Returns:

np.ndarray in shape (n_peaks, 2), dtype=np.float32: The cleaned spectrum will be guaranteed to be sorted by m/z in ascending order.