4.9. Mass Functions

Functions related to mass operations, inlcuding mapping and clustering.

Functions here could potentially be sped up by JIT, but the code is currently weakly typed. In order to use Numba for JIT, they need to be rewritten with clear typing and likely compartmentalized. Alternatively, some of the mass functions can be implemented in C and compiled to interface Python.

asari.mass_functions.all_mass_paired_mapping(list1, list2, std_ppm=5)[source]

To find all matched pairs of m/z values between two lists. When multiple matches occur within std_ppm, this keeps all pairs.

Singletons are included in returned result if no matches are found. This does not calculate ratio_deltas.

Parameters:

list1 (list[float]) – list of m/z values, not necessarially the same length as list2
list2 (list[float]) – list of m/z values, not necessarially the same length as list1
std_ppm (float, optional, default: 5) – limit of instrument accuracy in matching m/z values.

Returns:

mapped (list of mapped index pairs. E.g. [ (3, 6), (6, 8), (33, 151), …])
list1_unmapped (list of unmapped items in list1.)
list2_unmapped (list of unmapped items in list2.)

Note

This function returns all matched pairs within std_ppm using a tree search approach (mass2chem.search.build_centurion_tree_mzlist). The other functions, mass_paired_mapping and complete_mass_paired_mapping, return only one pair per match and use a sorting based algorithm.

asari.mass_functions.bin_by_median(List_of_tuples, func_tolerance)[source]

Separate List_of_tuples into bins within tolearance, by the distance of edge to median.

Parameters:

List_of_tuples (list[tuple]) – [(value, object), (value, object), …], to be separated into bins by values (either rt or mz). objects have attribute of sample_name if to align elsewhere.
func_tolerance (function) – tolearance function to define bounary to separate bins.

Returns:

A list of seprated bins,
each as a list of objects as [X[1] for X in L]. Possible all falls in same bin.

Note

Not perfect because left side may deviate out of tolerance, even though LC-MS data always have enough gaps for separation. Use with caution.

asari.mass_functions.calculate_selectivity(sorted_mz_list, std_ppm=5)[source]

To calculate m or d-selectivity for a list of m/z values, which can be all features in an experiment or a database. The mass selectivity between two m/z values is defined as: (1 - Probability(confusing two peaks)), further formalized as an exponential model: P = exp( -x/std_ppm ), whereas x is ppm distance between two peaks, std_ppm standard deviation of ppm between true peaks and theoretical values, default at 5 pmm.

The selectivity value is between (0, 1), close to 1 meaning high selectivity. If multiple adjacent peaks are present, we multiply the selectivity scores. It is good approximation by considering 2 lower and 2 higher neighbors here.

Parameters:

sorted_mz_list (list) – a list of m/z values, sorted from low to high, length > 3.
std_ppm (float, optional, default: 5) – mass resolution in ppm (part per million).

Return type:

A list of selectivity values, in matched order as the input m/z list.

Note

ppm is actually dependent on m/z, not an ideal method. But it’s in common practice and good enough approximation.

asari.mass_functions.check_close_mzs(mzlist, ppm_tol=5)[source]

Given a list of mz_values and a mz tolerance in ppm, determine if any pair of values differ by less than the mz tolerance. mzlist must be sorted in ascending order!

Parameters:: mzlist (list[float]) – list of floating point values represented m/z’s in ascending order
Return type:: a list of the index pairs representing m/z values within the m/z tolerance

asari.mass_functions.complete_mass_paired_mapping(list1, list2, std_ppm=5)[source]

To find best matched pairs of m/z values between two lists. When multiple matches occur within std_ppm, this keeps the pair of closest m/z.

This is different from mass_paired_mapping, which only keeps unique matches. Singletons are included in returned result if no matches are found. This does not calculate ratio_deltas.

Parameters:

list1 (list[float]) – list of m/z values, not necessarially the same length as list2
list2 (list[float]) – list of m/z values, not necessarially the same length as list1
std_ppm (float, optional, default: 5) – limit of instrument accuracy in matching m/z values.

Returns:

mapped – list of mapped index pairs. E.g. [ (3, 6), (6, 8), (33, 151), …]
list1_unmapped – list of unmapped items in list1.
list2_unmapped – list of unmapped items in list2.

Note

This and related functions are for m/z alignment only, not used for general search. See asari.tools.match_features for general search.

asari.mass_functions.flatten_tuplelist(L)[source]

Given a list of tuples e.g., [(a,b), …] flatten list to [a, b, …] keeping unique entries only

Parameters:: L (list[tuple]) – list of tuples [(a,b), …]
Returns:: list of the unique elements in L, flattened to one dimension
Return type:: [a, b, …]

asari.mass_functions.gap_divide_mz_cluster(bin_data_tuples, mz_tolerance)[source]

Divides bin_data_tuples by the largest gap in m/z values. mz_tolerance is not used now, assuming rarely needed after prior steps. This is a fallback method when identify_mass_peaks fails. See nn_cluster_by_mz_seeds.

Parameters:

bin_data_tuples (list[tuple]) – a flexible bin in format of [(mz, scan_num, intensity_int), …], or [(m/z, track_id, sample_id), …].
mz_tolerance – the allowed tolerance in m/z values, NOT USED.

Return type:

Two lists after dividing bin_data_tuples by the largest gap.

asari.mass_functions.identify_mass_peaks(bin_data_tuples, mz_tolerance, presorted=True)[source]

Get the centroid m/z values as peaks in m/z values distribution, at least mz_tolerance apart. This can be used to choose concensus m/z values in constructing or after aligning mass tracks. See nn_cluster_by_mz_seeds.

Parameters:

bin_data_tuples (list[tuple]) – a flexible bin in format of [(mz, scan_num, intensity_int), …], or [(m/z, track_id, sample_id), …].
mz_tolerance (float) – precomputed based on m/z and ppm, e.g. 5 ppm of 80 = 0.0004; 5 ppm of 800 = 0.0040.
presorted (boolean, optional, default: True) – flag to determine if sorting is needed on bin_data_tuples.

Return type:

A list of m/z values, peaks in m/z values distribution, at least mz_tolerance apart.

asari.mass_functions.landmark_guided_mapping(REF_reference_mzlist, REF_landmarks, SM_mzlist, SM_landmarks, std_ppm=5, correction_tolerance_ppm=1)[source]

This is the main m/z alignment function in asari. A new sample is compared to the reference in a MassGrid, then added to the MassGrid based on matched m/z values. The MassGrid records the alignment information.

This is a two-step process: aligning the anchors (landmarks) first, mz correction if needed, then completing the remaining m/z values. Since the landmarks are of high confidence, this improves the quality of m/z alignment.

Parameters:

REF_reference_mzlist (list) – the list of mz values from the REF sample
REF_landmarks (list) – the list of landmarks in the REF sample
SM_mzlist (list) – the list of mz values in the sample
SM_landmarks (list) – the list of landmarks in the sample
std_ppm (float, optional, default: 5) – the assumed mass resolution in ppm
correction_tolerance_ppm (float, optional, default: 1) – a mass correction is applied if the mass shift is above this value

Returns:

new_reference_mzlist – combined list of all unique m/z values, maintaining original order of REF_reference_mzlist but updating the values as mean of the two lists.
new_reference_map2 – mapping index numbers from SM_malist, to be used to update MassGrid[Sample.input_file]
REF_landmarks – updated landmark m/z values using the new index numbers as part of new_reference_mzlist
_r – correction ratios on SM_mzlist, to be attached to Sample class instance.

Note

The m/z values are updated here because this is the best place to do it: SM_mzlist is already corrected if needed; no need to look up irregular values in MassGrid. This mixes features from samples and they need to be consistent on how they are calibrated.

The mzlists are already in ascending order when a Sample is processed, but the order of REF_reference_mzlist will be disrupted during building MassGrid. Do correciton on list2 if m/z shift exceeds correction_tolerance_ppm. See MassGrid.add_sample.

asari.mass_functions.mass_paired_mapping(list1, list2, std_ppm=5)[source]

To find unambiguous matches of m/z values between two lists. Nonunique matches are left out.

This sorts all m/z values first, then compare their differences in sequential neighbors. To be considered as an unambiguous match, the m/z values from two lists should have no overlap neighbors in either direction in either list other than their own pair. Thus, oow-selectiviy values are not considered in matching. This and related functions are for m/z alignment only, not used for general search. See asari.tools.match_features for general search.

This shares some similarity to the RANSAC algorithm but prioritizes selectivity. For illustration, one can use one-step Gaussian model for mass shift. Since only mean shift is used here, and stdev is implicitly enforced in matching, no need to do model fitting.

Parameters:

list1 (list[float]) – list of m/z values, not necessarially the same length as list2
list2 (list[float]) – list of m/z values, not necessarially the same length as list1
std_ppm (float, optional, default: 5) – limit of instrument accuracy in matching m/z values.

Returns:

mapped – mapping list [(index from list1, index from list2), …]
ratio_deltas – mean m/z ratio shift between two lists. This is ppm*10^-6. No need to convert btw ppm here.

Examples

>>> list1 = [101.0596, 101.061, 101.0708, 101.0708, 101.1072, 101.1072, 101.1072, 102.0337,
0337, 102.0548, 102.0661, 102.0912, 102.0912, 102.1276, 102.1276, 103.0501,
0501, 103.0541, 103.0865, 103.0865, 103.9554, 104.0368, 104.0705, 104.0705,
1069, 104.1069, 104.9922, 105.0422, 105.0698, 105.0698, 105.0738, 105.1039,
1102, 105.9955, 106.0497, 106.065, 106.065, 106.0683, 106.0683, 106.0861, 106.0861,
0861, 106.1111, 106.9964, 107.0475, 107.0602, 107.0653, 107.0895, 107.9667, 108.0443,
0555, 108.0807, 109.0632, 109.0759]
>>> list2 = [101.0087, 101.035, 101.0601, 101.0601, 101.0601, 101.0601, 101.0713, 101.0714,
1077, 101.1077, 101.1077, 101.1077, 101.1077, 101.1158, 101.1158, 102.0286, 102.0376,
0468, 102.0539, 102.0554, 102.0554, 102.0554, 102.0554, 102.0666, 102.0917, 102.0917,
0917, 102.0918, 102.1281, 102.1281, 102.1282, 103.0394, 103.0505, 103.0507, 103.0547,
1233, 103.8162, 103.956, 103.956, 103.956, 104.0532, 104.0533, 104.0641, 104.0709,
071, 104.0831, 104.0878, 104.0895, 104.0953, 104.1073, 104.1073, 104.1074, 104.1074,
1182, 104.1199, 104.1265, 104.1318, 104.1354, 104.1725, 104.3998, 104.9927, 104.9927,
9927, 104.9927, 105.0654, 105.0703, 105.1043, 105.1133, 106.049, 106.0503, 106.0655,
0688, 106.0866, 106.0867, 106.0867, 106.0867, 106.114, 107.048, 107.0481, 107.0496,
0608, 107.0658, 108.0109, 108.0482, 108.0604, 108.0812, 108.0812, 108.9618, 109.0507,
0637, 109.0637, 109.0764, 109.1015]
>>> mass_paired_mapping(list1, list2) >>>
    ([(10, 23), (29, 65), (31, 66), (36, 70), (38, 71), (46, 81), (53, 91)],
    [4.898762180656323e-06,
758718686464085e-06,
805743437700149e-06,
714068193732999e-06,
713921530199148e-06,
670025348919892e-06,
583942997773922e-06])

asari.mass_functions.mass_paired_mapping_with_correction(list1, list2, std_ppm=5, correction_tolerance_ppm=1)[source]

To find unambiguous matches of m/z values between two lists, with correciton on list2 if m/z shift exceeds correction_tolerance_ppm. See mass_paired_mapping for details.

Parameters:

list1 (list[float]) – list of m/z values, not necessarially the same length as list2
list2 (list[float]) – list of m/z values, not necessarially the same length as list1
std_ppm (float, optional, default: 5) – limit of instrument accuracy in matching m/z values.
correction_tolerance_ppm (float, optional, default: 1) – threshold to trigger m/z recalibration of list2.

Returns:

mapped (mapping list [(index from list1, index from list2), …])
_r (correction ratios on list2)

asari.mass_functions.nn_cluster_by_mz_seeds(bin_data_tuples, mz_tolerance, presorted=True)[source]

Complete NN clustering, by assigning each data tuple to its closest m/z seed. This is used for both m/z alignment and mass track construction. See chromatograms.build_chromatogram_by_mz_clustering and constructors.MassGrid.bin_track_mzs.

Parameters:

bin_data_tuples (list[tuple]) – a flexible bin in format of [(mz, scan_num, intensity_int), …], or [(m/z, track_id, sample_id), …].
mz_tolerance (float) – precomputed based on m/z and ppm, e.g. 5 ppm of 80 = 0.0004; 5 ppm of 800 = 0.0040.
presorted (boolean, optional, default: True) – flag to determine if sorting is needed on bin_data_tuples.

Return type:

A list of clusters as separated bins, each bin as [(mz, scan_num, intensity_int), …]

Note

Bug in Python compiler: when clusters = [[]] * _NN is used, it causes occassional duplication of entries. 2022-05-21.

Future consideration: np.argmin will be faster than sorted here.