API Reference

class ms_entropy.DynamicEntropySearch(path_data, max_ms2_tolerance_in_da=0.024, extend_fold=3, mass_per_block: float = 0.05, num_per_group: int = 100000000, cache_list_threshold: int = 1000000, max_indexed_mz: float = 1500.00005, intensity_weight='entropy')[source]

Bases: object

Initialize the DynamicEntropySearch object.

Parameters:
path_datastr or Path

Path to the directory where index files are stored.

max_ms2_tolerance_in_dafloat, optional

Maximum MS/MS tolerance (in Daltons) used during spectrum search. Default is 0.024.

extend_foldint, optional

Expansion factor for preallocated storage in each m/z block. Determines reserved_len = data_len * extend_fold. Default is 3.

mass_per_blockfloat, optional

m/z step size for creating the index blocks. Default is 0.05 Da.

num_per_groupint, optional

Number of spectra assigned to each group. Default is 100,000,000.

cache_list_thresholdint, optional

Number of spectra to accumulate in memory before writing them to disk. Default is 1,000,000.

max_indexed_mzfloat, optional

Maximum m/z value to index. Ions above this threshold are grouped into a single block. Default is 1500.00005.

intensity_weight{“entropy”, None}, optional

Determines whether intensities are entropy-weighted. If "entropy", intensities are weighted accordingly. If None, intensities remain unweighted (equivalent to raw entropy similarity). Default is "entropy".

Notes

If the index directory already contains group_start.pkl and metadata_start_loc.bin, they are loaded automatically. Otherwise, new metadata structures are initialized.

The underlying index and search engine is implemented in DynamicEntropySearchCore, which is initialized for the most recent group.

add_new_spectra(spectra_list: list, insert_mode='fast_update', index_for_neutral_loss: bool = True, convert_to_flash: bool = True, clean=True, precursor_ions_removal_da: float = 1.6, noise_threshold: float = 0.01, min_ms2_difference_in_da: float = 0.05, max_peak_num: int = -1)[source]

Add new spectra to the index.

This function serializes spectrum metadata, appends it to the metadata storage files, updates metadata offsets, and temporarily stores spectra in an in-memory cache. When the cache size exceeds cache_list_threshold, the function triggers incremental index building.

Parameters:
spectra_listlist

A list of spectra to be added. All spectra must have a single ion mode and should be preprocessed into the correct format.

insert_mode{“fast_update”, “fast_search”}, optional

Insertion mode used when updating the index.

  • "fast_update" (default): Building index structures without resorting.

  • "fast_search": Building index structures with resorting.

index_for_neutral_lossbool, optional

If True (default), the index will also maintain entries for neutral-loss ions when building new index blocks.

convert_to_flashbool, optional

Whether to convert spectra into a compact format as the FlashEntropySearch after the group is full. Default is True.

cleanbool, optional

Whether to clean the spectra before adding. Default is True.

precursor_ions_removal_dafloat, optional

Peaks with m/z greater than precursor_mz - precursor_ions_removal_da are removed during cleaning. Default is 1.6 Da.

noise_thresholdfloat, optional

Relative intensity threshold for noise filtering during cleaning. Peaks with intensity < noise_threshold * max(intensity) are removed. Default is 0.01.

min_ms2_difference_in_dafloat, optional

Minimum spacing allowed between MS/MS peaks during cleaning. Default is 0.05 Da.

max_peak_numint or None, optional

Maximum number of peaks to keep after cleaning. None keeps all peaks. Default is None.

Returns:
None

Notes

Spectra are held in cache_list until the cache reaches the size specified by cache_list_threshold. At that point, build_index() is invoked to integrate the cached spectra into the persistent index.

build_index(insert_mode='fast_update', index_for_neutral_loss: bool = True, convert_to_flash: bool = True)[source]

Build or update the spectral index.

This method processes the spectra stored in cache_list and integrates them into the on-disk index. Depending on the current number of indexed spectra, the method may insert spectra into the existing index, build the index from scratch, or create a new index group. It is also invoked internally by add_new_spectra().

Parameters:
insert_mode{“fast_update”, “fast_search”}, optional

The index update strategy.

  • "fast_update" (default): Incrementally appends spectra to the existing index without resorting blocks.

  • "fast_search": Resorts index blocks for optimized search performance.

index_for_neutral_lossbool, optional

If True (default), neutral-loss ions are also indexed when building new blocks.

convert_to_flashbool, optional

When a new index group is created, determines whether the existing index is converted to the compact Flash Entropy Search format before writing. Default is True.

Returns:
None
convert_current_index_to_flash()[source]

Convert the current dynamic index into the Flash Entropy Search format.

This method extracts the peak data from the currently active dynamic index and rebuilds it as a Flash-format index using DynamicWithFlash. After conversion, the original dynamic index files are removed and replaced with the Flash-formatted index.

This method is intended to be used after calling convert_to_fast_search(), which ensures that all groups except the active one have already been converted.

Returns:
None

Convert all dynamic index groups to the Flash Entropy Search format.

This method iterates through all index groups in the library and converts any group stored in dynamic indexing format (identified by the presence of information_dynamic.json) into the compact and search-optimized Flash Entropy Search format.

This operation should be performed after calling write(), ensuring that all index data is fully written before conversion.

Returns:
None
get_metadata(group_idx, spec_idx)[source]

Retrieve the metadata for a spectrum specified by its group and within-group index.

This method computes the global spectrum index by combining the group-level offset stored in group_start with the spectrum’s position inside that group. It then returns the metadata for the corresponding spectrum.

Parameters:
group_idxint

The index of the spectrum group. group_start[group_idx] gives the global index of the first spectrum in this group.

For example: - Group 0 always begins at global index 0. - If group 0 contains 1,000,000 spectra, then group 1 begins at global index 1,000,000; group 2 begins at the cumulative count of groups 0 and 1, etc.

spec_idxint

The index of the spectrum within the specified group. The global spectrum index is computed as:

global_spec_idx = group_start[group_idx] + spec_idx

Returns:
dict

The metadata dictionary for the requested spectrum.

Notes

This method internally uses __getitem__() to retrieve the spectrum metadata once the global index has been determined.

Perform a hybrid search across the spectral library.

Hybrid search combines both open search and neutral-loss search to evaluate the similarity between the query spectrum and all spectra in the library.

Parameters:
precursor_mzfloat

The precursor m/z of the query spectrum.

peaksarray-like

The cleaned peaks of the query spectrum. The spectrum should be preprocessed before performing the search.

ms2_tolerance_in_dafloat

Fragment-ion (MS2) tolerance in Daltons.

Returns:
numpy.ndarray

A 1D array of entropy-based hybrid similarity scores with shape (N,), where N is the total number of spectra in the library.

Perform an identity search across all indexed spectra.

This method searches the entire spectral library under self.path_data for spectra whose precursor m/z values fall within the specified MS1 tolerance of the query precursor. For each matching spectrum, the entropy similarity score is computed using the underlying DynamicEntropySearchCore or DynamicWithFlash search engine.

Parameters:
precursor_mzfloat

The precursor m/z value of the query spectrum.

peaksarray-like

A list or array of peaks from the query spectrum. Peaks must be preprocessed (cleaned, centroided, filtered, etc.) before calling this method.

ms1_tolerance_in_dafloat

MS1 (precursor m/z) tolerance in Daltons.

ms2_tolerance_in_dafloat

MS2 (fragment ion) tolerance in Daltons.

Returns:
numpy.ndarray

A 1D array of entropy similarity scores of length N, where N is the total number of spectra in the library. Spectra whose precursor m/z does not match within tolerance receive a score of 0.0.

Perform a neutral-loss search across the spectral library.

This method computes entropy similarity scores based on neutral-loss matching, where peaks are compared after subtracting the precursor m/z.

Parameters:
precursor_mzfloat

The precursor m/z value of the query spectrum.

peaksarray-like

The cleaned peaks of the query spectrum. Peaks should be preprocessed before calling this method.

ms2_tolerance_in_dafloat

Fragment-ion (MS2) tolerance in Daltons.

Returns:
numpy.ndarray

A 1D array of entropy similarity scores with shape (N,), where N is the total number of spectra in the library.

Perform an open search across the entire spectral library.

This method computes entropy similarity scores between the query spectrum and all spectra stored in the library, using the open search strategy of DynamicEntropySearchCore or DynamicWithFlash.

Parameters:
peaksarray-like

The peaks of the query spectrum. The spectrum must be preprocessed (e.g., centroided, filtered) before search.

ms2_tolerance_in_dafloat

Fragment-ion (MS2) tolerance in Daltons.

Returns:
numpy.ndarray

A 1D array of entropy similarity scores with shape (N,), where N is the total number of spectra in the library.

read()[source]

Load index metadata from disk.

This method reads previously saved group boundary information and metadata start offsets from the index directory.

Returns:
None

Notes

The following components are loaded:

  • group_start from group_start.pkl

Contains the cumulative spectrum counts marking the start of each index group.

  • metadata_start_loc from metadata_start_loc.bin

A memory-mapped array of byte offsets indicating the starting position of each serialized spectrum metadata entry.

search(precursor_mz, peaks, ms1_tolerance_in_da=0.01, ms2_tolerance_in_da=0.02, method='all', precursor_ions_removal_da: float = 1.6, clean=True, noise_threshold=0.01, min_ms2_difference_in_da=0.05, max_peak_num=None)[source]

Perform spectral search on a query spectrum.

This method performs one or more search strategies—including identity, open, neutral-loss, and hybrid search—across the indexed spectral library. The results are returned as a dictionary keyed by search method.

Parameters:
precursor_mzfloat

The precursor m/z value of the query spectrum.

peaksarray-like

The MS/MS peaks of the query spectrum, with shape (N, 2) where each row is [mz, intensity]. Peaks must follow this format and may optionally be cleaned.

ms1_tolerance_in_dafloat, optional

Tolerance for precursor m/z matching in identity search. Default is 0.01 Da.

ms2_tolerance_in_dafloat, optional

Fragment-ion tolerance used by all search modes. Default is 0.02 Da.

method{“identity”, “open”, “neutral_loss”, “hybrid”, “all”} or list, optional

Specifies the search mode(s) to execute.

  • "identity" - identity search

  • "open" - open search

  • "neutral_loss" - neutral-loss search

  • "hybrid" - combined open + neutral-loss search

  • "all" (default) - run all four modes

precursor_ions_removal_dafloat, optional

Peaks with m/z greater than precursor_mz - precursor_ions_removal_da are removed during spectrum cleaning. Default is 1.6 Da.

cleanbool, optional

Whether to clean the query spectrum before searching. Default is True.

noise_thresholdfloat, optional

Peaks with intensities below noise_threshold * max(intensity) are removed during cleaning. Default is 0.01.

min_ms2_difference_in_dafloat, optional

Minimum allowed spacing between MS/MS peaks after cleaning. Default is 0.05 Da.

max_peak_numint or None, optional

Maximum number of peaks to retain after cleaning. None (default) keeps all peaks.

Returns:
dict

A dictionary mapping each selected search method to its corresponding entropy similarity score array.

Each value is a NumPy array with length equal to the total number of spectra in the library.

Notes

  • Cleaning is performed by clean_spectrum() if clean=True.

  • When method="all", all four search strategies are executed.

  • Only the relevant search modes are invoked based on the method argument.

search_topn_matches(peaks, precursor_mz=None, ms1_tolerance_in_da=0.01, ms2_tolerance_in_da=0.02, method='open', clean=True, precursor_ions_removal_da: float = 1.6, noise_threshold=0.01, min_ms2_difference_in_da=0.05, max_peak_num=None, topn: int = 3, need_metadata: bool = True)[source]

Search the spectral library and return the top-N most similar spectra.

This method performs one selected search mode (identity, open, neutral-loss, or hybrid) and returns the best matching spectra according to the similarity scores. Optionally, the query spectrum can be cleaned prior to searching, and metadata for the matched spectra can be returned.

Parameters:
peaksarray-like

The MS/MS peaks of the query spectrum, with shape (N, 2) in the format [[mz1, intensity1], [mz2, intensity2], ...].

precursor_mzfloat, optional

The precursor m/z of the query spectrum. Required for "identity", "neutral_loss", and "hybrid" modes.

ms1_tolerance_in_dafloat, optional

MS1 tolerance in Daltons used in identity filtering. Default is 0.01 Da.

ms2_tolerance_in_dafloat, optional

MS2 fragment tolerance in Daltons for similarity computation. Default is 0.02 Da.

method{“identity”, “open”, “neutral_loss”, “hybrid”}, optional

The search mode to perform. Default is "open".

cleanbool, optional

Whether to clean the query spectrum before searching. Default is True.

precursor_ions_removal_dafloat, optional

Peaks with m/z greater than precursor_mz - precursor_ions_removal_da are removed during cleaning. Default is 1.6 Da.

noise_thresholdfloat, optional

Relative intensity threshold for noise filtering during cleaning. Peaks with intensity < noise_threshold * max(intensity) are removed. Default is 0.01.

min_ms2_difference_in_dafloat, optional

Minimum spacing allowed between MS/MS peaks during cleaning. Default is 0.05 Da.

max_peak_numint or None, optional

Maximum number of peaks to keep after cleaning. None keeps all peaks. Default is None.

topnint, optional

Number of top-matching spectra to return. If None, all spectra are returned. Default is 3.

need_metadatabool, optional

If True (default), return the metadata dictionary for each matched spectrum. If False, return (global_index, similarity) tuples instead. Default is True.

Returns:
list or list of tuples
If need_metadata=True:

A list of metadata dictionaries, each containing the search result and the similarity score under the key.

If need_metadata=False:

A list of tuples (global_spec_idx, similarity_score) with a one-to-one correspondence between indices and scores.

write()[source]

Write index metadata and group information to disk.

Returns:
None

Notes

The following components are written:

  • Internal index data via DynamicEntropySearchCore.write().

  • The list group_start is serialized to group_start.pkl to record the cumulative spectrum count at the start of each index group.

class ms_entropy.DynamicEntropySearchCore(path_data, max_ms2_tolerance_in_da=0.024, extend_fold=3, mass_per_block: float = 0.05, max_indexed_mz: float = 1500.00005, intensity_weight='entropy')[source]

Bases: object

Initialize the DynamicEntropySearchCore object.

Parameters:
path_datastr or Path

Path to the directory where index data for this group will be stored.

max_ms2_tolerance_in_dafloat, optional

Maximum MS2 fragment-ion tolerance (in Daltons) allowed during entropy search. Default is 0.024.

extend_foldint, optional

Expansion factor applied when allocating space for block data. Must be greater than 1. Default is 3.

mass_per_blockfloat, optional

The m/z interval used to define ion index blocks. Default is 0.05 Da.

max_indexed_mzfloat, optional

Maximum fragment-ion m/z to index. Fragment ions with larger m/z values are placed into a single block. Default is 1500.00005.

intensity_weight{“entropy”, None}, optional

Whether fragment intensities should be entropy-weighted. If "entropy", intensities are transformed using entropy weighting. If None, raw intensity values are used. Default is "entropy".

Returns:
None
add_new_spectrum_into_index(add_spectrum_list: list)[source]

Add new spectra into the index with full sorting.

This method maintains sorted order within each block by merging existing block data with the new spectra and re-sorting the combined block.

Parameters:
add_spectrum_listlist of dict

A list of spectra to be added.

Returns:
None

The method updates the on-disk index and in-memory block metadata in-place.

build_index(all_spectra_list: list, index_for_neutral_loss: bool = True)[source]

Build the fragment-ion index (and optional neutral-loss index) for the MS/MS spectral library.

The input spectra must be dictionaries with the structure:

{
    "precursor_mz": float,
    "peaks": numpy.ndarray
}

where:

  • precursor_mz is the precursor m/z value of the MS/MS spectrum

  • peaks is a 2-column numpy array that has been preprocessed using clean_spectrum(), containing sorted and normalized (m/z, intensity) pairs

Parameters:
all_spectra_listlist of dict

A list of spectrum dictionaries in the format described above.

index_for_neutral_lossbool, optional

Whether to also build the neutral-loss index. If True, the method generates both the product-ion and neutral-loss indices. If False, only the product-ion index is built. Default is True.

Returns:
None

The method updates internal index structures in-place.

Convert the current index into fast search mode.

In fast search mode, all peaks stored in the index are sorted by their corresponding mass key:

  • Product-ion index is sorted by fragment m/z.

  • Neutral-loss index (if present) is sorted by neutral-loss mass.

Parameters:
None
Returns:
None

The method updates the on-disk index and in-memory block metadata in-place.

fast_add_new_spectrum_into_index(add_spectrum_list: list)[source]

Add new spectra into the existing index in fast-update mode (without resorting).

This method appends new spectra to the current on-disk index structure by inserting their fragment ions (and, if available, neutral-loss mass) into existing blocks. Blocks whose reserved capacity is exceeded are moved and expanded, but no global re-sorting of the full index is performed.

Parameters:
add_spectrum_listlist of dict

A list of spectra to be added.

Returns:
None

The method updates the on-disk index and in-memory block information in-place.

get_topn_spec_idx_and_similarity(similarity_array, topn=None, min_similarity=0.1)[source]

Get the indices and similarity scores of the top-N most similar items.

This function sorts the similarity array in descending order, selects the top-N indices, and filters out those below the provided minimum similarity threshold.

Parameters:
similarity_arraynumpy.ndarray

Array of similarity scores.

topnint, optional

Number of top similarity scores to return. If None, all entries are considered.

min_similarityfloat, optional

Minimum similarity threshold. Scores below this value are excluded. Default is 0.1.

Returns:
tuple[list[int], list[float]]

A tuple containing:

  • result_idx - List of indices corresponding to selected scores.

  • result - List of similarity values for the selected indices.

read(path_data=None)[source]

Load previously built MS/MS spectral index from disk.

Parameters:
path_datastr or Path, optional

Path to the directory containing the index files. If None, the method uses self.path_data. Default is None.

Returns:
bool

True if the index was successfully read, False if loading failed for any reason.

Notes

  • Any exception during loading returns False and leaves previous state unchanged.

remove_index(path_data=None)[source]

Remove an existing MS/MS spectral index from disk.

This method deletes all dynamic index-related files, including block metadata, binary index blocks, and data files. It is typically used after converting an index to a compact format.

Parameters:
path_datastr or Path, optional

Path to the directory containing the index files to be removed. If None, the method uses self.path_data. Default is None.

Returns:
None

The method removes files from disk and returns None.

search(method='open', precursor_mz=None, peaks=None, ms2_tolerance_in_da=0.02)[source]

Perform open search or neutral-loss search on the MS/MS spectral library.

Parameters:
method{“open”, “neutral_loss”}, optional

Search mode.

  • "open" — identity search or open search

  • "neutral_loss" — neutral-loss-based matching

Default is "open".

precursor_mzfloat, optional

Precursor m/z of the query MS/MS spectrum. Required when method="neutral_loss".

peaksnumpy.ndarray

Array of fragment peaks from the query spectrum. The peaks must be preprocessed using clean_spectrum() and normalized such that intensities sum to 1.

ms2_tolerance_in_dafloat, optional

Fragment mass tolerance (Da) used for peak matching. Must be less than or equal to max_ms2_tolerance_in_da. Default is 0.02.

Returns:
numpy.ndarray

A 1-D array of entropy similarity scores for all spectra in the library, with dtype float32. Each entry corresponds to the similarity score for one reference spectrum.

search_hybrid(precursor_mz, peaks, ms2_tolerance_in_da=0.02)[source]

Perform hybrid search against the MS/MS spectral index.

Hybrid search incorporates both: - Open search (direct fragment ion matching) - Neutral-loss search (matching peaks transformed as precursor_mz - peak_mz)

Parameters:
precursor_mzfloat

The precursor m/z of the query MS/MS spectrum.

peaksnumpy.ndarray

Fragment ions of the query spectrum. Must be preprocessed using clean_spectrum(), sorted by m/z, and normalized such that the sum of intensities equals 1.

ms2_tolerance_in_dafloat, optional

Mass tolerance (Da) for fragment matching. Must be ≤ max_ms2_tolerance_in_da. Default is 0.02.

Returns:
numpy.ndarray

A 1-D array of entropy similarity scores between the query spectrum and each library spectrum. Dtype is float32 with length equal to the number of indexed spectra.

write(path_data=None)[source]

Write the currently built MS/MS spectral index to disk.

This method serializes all index blocks (product-ion index and optionally the neutral-loss index) along with their associated metadata into the specified directory. The output directory will be created if it does not exist.

Parameters:
path_datastr or Path, optional

Path to the directory where index files will be saved. If None, the method uses self.path_data. Default is None.

Returns:
None

The method writes files to disk and returns None.