Basic usage - Index Construction

In order to perform spectra comparison more efficiently, spectral data should be loaded into a reference library in the form of index.

Step 0: Prepare the library spectra

Suppose you have a lot of spectra and want to build library based on them, you need to format them like this:

import numpy as np
# For each spectral library, it is a list consisting of multiple dictionaries of MS2 spectra.

# For each spectrum, 'precursor_mz' and 'peaks' are necessary.
# 'precursor_mz' should be a float, and 'peaks' should be a 2D np.ndarray like np.ndarray([[m/z, intensity], [m/z, intensity], [m/z, intensity]...], dtype=np.float32).


spectra_1_for_library = [{
    "id": "Demo spectrum 1",
    "precursor_mz": 150.0,
    "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [103.0, 1.0]], dtype=np.float32),
}, {
    "id": "Demo spectrum 2",
    "precursor_mz": 200.0,
    "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32),
    "metadata": "ABC"
}, {
    "id": "Demo spectrum 3",
    "precursor_mz": 250.0,
    "peaks": np.array([[200.0, 1.0], [101.0, 1.0], [202.0, 1.0]], dtype=np.float32),
    "XXX": "YYY",
}, {
    "precursor_mz": 350.0,
    "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [302.0, 1.0]], dtype=np.float32),
},
    ]

spectra_2_for_library ... # Similar to spectra_1_for_library
spectra_3_for_library ... # Similar to spectra_1_for_library

The keys precursor_mz and peaks are necessary for this format. Other keys are optional and are considered as metadata of the spectrum, helping the identification.

Then you can have your spectra lists to be added into the library.

Step 1: Perform updates

Initial construction

Suppose that you want to construct an index with spectra_1_for_library at first:

# Firstly, import DynamicEntropySearch.
from ms_entropy import DynamicEntropySearch

# Secondly, assign the path for your library.
entropy_search=DynamicEntropySearch(
        path_data=path_of_your_library,
        max_ms2_tolerance_in_da=0.024, # Maximum MS/MS tolerance (in Daltons) used during spectrum search.
        extend_fold=3, # Expansion factor for preallocated storage in each m/z block. Determines ``reserved_len = data_len * extend_fold``.
        mass_per_block=0.05, # m/z step size for creating the index blocks.
        num_per_group=100_000_000, # Number of spectra assigned to each group.
        cache_list_threshold=1_000_000, # Number of spectra to accumulate in memory before writing them to disk.
        max_indexed_mz=1500.00005, # Maximum m/z value to index. Ions above this threshold are grouped into a single block.
        intensity_weight="entropy",  # "entropy" or None.Determines whether intensities are entropy-weighted.
)

# Thirdly, add spectra list into the library one by one. By default, there will be a built-in cleaning function.
entropy_search.add_new_spectra(spectra_list=spectra_1_for_library)

# Lastly, call build_index() and write() to end the adding operation.
entropy_search.build_index()
entropy_search.write()

There are some tips in this process:

It is necessary to initialize DynamicEntropySearch using a specified path_data, which is the path of your library. The reset of the parameters are optional. If it is a new library, the value of path_data should be new and will be created in the initialization of class.
If you only want to build index for open search, you can set index_for_neutral_loss in add_new_spectra() and build_index() to False. However, after doing this, you couldn’t perform neutral loss search or hybrid search under this library anymore. Besides, adding neutral loss mass index into this library is violated too. This means that once the index_for_neutral_loss in the add_new_spectra() function as well as build_index() function are set to False, they should remain False from then on. Any violation can cause errors.
There is a built-in cleaning function in add_new_spectra(). Peaks can be cleaned using this function. Cleaning is a necessary part in construction. If clean in add_new_spectra() is set to False, use external clean function clean_spectrum() in ms_entropy to process spectra. See the following example. Lack of cleaning can lead to error.
It is necessary to call build_index() and write() lastly after all add_new_spectra() as the end of adding operation to make sure all the spectra are loaded into the index.

If use external clean function to process spectra in construction:

# Firstly, import.
from ms_entropy import DynamicEntropySearch
from ms_entropy import clean_spectrum

# Secondly, assign the path for your library.
entropy_search=DynamicEntropySearch(
        path_data=path_of_your_library,
        max_ms2_tolerance_in_da=0.024, # Maximum MS/MS tolerance (in Daltons) used during spectrum search.
        extend_fold=3, # Expansion factor for preallocated storage in each m/z block. Determines ``reserved_len = data_len * extend_fold``.
        mass_per_block=0.05, # m/z step size for creating the index blocks.
        num_per_group=100_000_000, # Number of spectra assigned to each group.
        cache_list_threshold=1_000_000, # Number of spectra to accumulate in memory before writing them to disk.
        max_indexed_mz=1500.00005, # Maximum m/z value to index. Ions above this threshold are grouped into a single block.
        intensity_weight="entropy",  # "entropy" or None.Determines whether intensities are entropy-weighted.
)

# Manually clean using external clean function
precursor_ions_removal_da = 1.6

spectra_1_for_library_clean=[]
for spec in spectra_1_for_library:
    spec['peaks'] = clean_spectrum(
        peaks = spec['peaks'],
        max_mz = spec['precursor_mz'] - precursor_ions_removal_da)
    if len(spec['peaks']) > 0:
        spectra_1_for_library_clean.append(spec)

# Thirdly, add spectra list into the library one by one. Set `clean` to `False` because spectra have been cleaned before.
entropy_search.add_new_spectra(spectra_list=spectra_1_for_library_clean, clean=False)

# Lastly, call build_index() and write() to end the adding operation.
entropy_search.build_index()
entropy_search.write()

Generally, we recommend internal clean in add_new_spectra().

Note that three parameters:

max_ms2_tolerance_in_da in the initialization of class DynamicEntropySearch()
min_ms2_difference_in_da in add_new_spectra()
ms2_tolerance_in_da in search functions of DynamicEntropySearch()

should follow this rule: min_ms2_difference_in_da > max_ms2_tolerance_in_da * 2 >= ms2_tolerance_in_da * 2.

An error will be reported if the condition is not met.

Once these steps are complete, you will find a folder, which serves as the library, at the path_data. In this folder, several binary files and one or more subfolders can be found. These binary files record the information of subfolders and metadata. Each subfolder refers to a group — the organizational unit directly under a library. These subfolders are numerically named, starting from 0.

Example structure — one library containing 3 groups:

path_of_your_library/
├── 0/
├── 1/
├── 2/
├── group_start.pkl
├── metadata_start_loc.bin
└── metadata.pkl

The library with index of spectra_1_for_library is created. The spectra_1_for_library is now saved as index in this library in group 0. You can fetch the library whenever you want.

Subsequent construction

Next time, if you want to update the index with spectra_2_for_library and spectra_3_for_library, just select the correct path and execute the update:

# Firstly, import DynamicEntropySearch.
from ms_entropy import DynamicEntropySearch

# Secondly, choose the existing library with corresponding path. This library is built with spectra_1_for_library last time.
entropy_search=DynamicEntropySearch(
        path_data=path_of_your_library,
        max_ms2_tolerance_in_da=0.024, # Maximum MS/MS tolerance (in Daltons) used during spectrum search.
        extend_fold=3, # Expansion factor for preallocated storage in each m/z block. Determines ``reserved_len = data_len * extend_fold``.
        mass_per_block=0.05, # m/z step size for creating the index blocks.
        num_per_group=100_000_000, # Number of spectra assigned to each group.
        cache_list_threshold=1_000_000, # Number of spectra to accumulate in memory before writing them to disk.
        max_indexed_mz=1500.00005, # Maximum m/z value to index. Ions above this threshold are grouped into a single block.
        intensity_weight="entropy",  # "entropy" or None.Determines whether intensities are entropy-weighted.
)

# Thirdly, add spectra list into the library one by one.
entropy_search.add_new_spectra(spectra_list=spectra_2_for_library)
entropy_search.add_new_spectra(spectra_list=spectra_3_for_library)

# Lastly, call build_index() and write() to end the adding operation.
entropy_search.build_index()
entropy_search.write()

Now the library in path_of_your_library contains index of 3 spectra lists: spectra_1_for_library, spectra_2_for_library and spectra_3_for_library. They may be distributed in one or more groups, depending on the number of spectra and the value of num_per_group.

Tools: external clean function

Function add_new_spectra() includes internal cleaning of reference spectra before actually constructing index.

If you want to seperate these two process, you can set clean in this function to False and use an external clean function. See example.

You can use the clean_spectrum() function to clean the reference spectra and then use add_new_spectra() to perform updates.

Clean spectrum

Before performing a library update, the reference spectrum should be pre-processed using the clean_spectrum() function in ms_entropy. This function accomplishes the following:

Remove empty peaks (m/z <= 0 or intensity <= 0).
Remove peaks with m/z values greater than precursor_mz - precursor_ions_removal_da (removes precursor ions to improve the quality of spectral comparison).
Centroid the spectrum by merging peaks within +/- min_ms2_difference_in_da and sort the resulting spectrum by m/z.
Remove peaks with intensity less than noise_threshold * maximum intensity.
Retain only the top max_peak_num peaks and remove all others.
Normalize the intensity to sum to 1.