4.4. Chromatograms
Functions related to chromatograms and mass tracks. Use integers for RT scan numbers and intensities. Flexible binning based on ppm accuracy. Use mz tol (default 5 pmm) in XIC construction. XICs without neighbors within x ppm are considered specific (i.e. high selectivity). Low selectivity regions will be still inspected to determine the true number of XICs.
- asari.chromatograms.bin_to_mass_tracks(bin_data_tuples, rt_length, mz_tolerance_ppm=5)[source]
Construct mass tracks from data that may exceed proper m/z range (i.e. over mz_tolerance_ppm). A mass track is an EIC for full RT range, without separating the mass traces. Imperfect ROIs will be examined in extract_massTracks_ and merged if needed.
- Parameters:
bin_data_tuples (list[tuple]) – a flexible bin by units of 0.001 amu, in format of [(mz, scan_num, intensity_int), …]. This may or may not be within mz_tolerance_ppm.
rt_length (int) – full number of scans.
mz_tolerance_ppm (float, optional, default: 5) –
- Return type:
a list of massTracks as [(mz, np.array(intensities at full rt range)), …].
See also
- asari.chromatograms.build_chromatogram_by_mz_clustering(bin_data_tuples, rt_length, mz_tolerance)[source]
Generates a list of prototype extracted ion chromatograms to be used by extract_single_track_fullrt_length.
- Parameters:
bin_data_tuples (list[tuple]) – a flexible bin in format of [(mz, scan_num, intensity_int), …].
rt_length (int) – number of scans (NOT USED, remove from call signature in future)
mz_tolerance (float) – precomputed based on m/z and ppm, e.g. 5 ppm of 80 = 0.0004; 5 ppm of 800 = 0.0040.
- Return type:
A list of clusters as separated bins, each bin as [(mz, scan_num, intensity_int), …]
- asari.chromatograms.build_chromatogram_intensity_aware(bin_data_tuples, rt_length, mz_tolerance)[source]
Chromatogram builder as in ADAP for testing. Start with highest intensity, going down by intensity and include data points within mz_tolerance. Repeat until no track is detected. Without requiring continuous RT, which is handled in extract_single_track_fullrt_length. Default in asari is build_chromatogram_by_mz_clustering.
- Parameters:
bin_data_tuples (list[tuple]) – a flexible bin in format of [(mz, scan_num, intensity_int), …].
rt_length (int) – number of scans (NOT USED, remove from call signature in future)
mz_tolerance_ppm (float) – m/z tolerance in part-per-million. Used to seggregate m/z regions here.
- Returns:
separated bins of [(mz, scan_num, intensity_int), …], prototype of extracted ion chromatograms to be used by extract_single_track_fullrt_length.
- Return type:
assigned
- asari.chromatograms.clean_rt_calibration_points(rt_cal_pairs)[source]
Remove redundant RT calibration data points and outliers (out of 3x stdev). This does not force even distribution of calibration data points.
- Parameters:
rt_cal_pairs (list of paired scan numbers from this sample and from the reference sample.) –
- Returns:
rt_cal_pairs
- Return type:
clean and sorted version of rt_cal_pairs.
- asari.chromatograms.dwt_rt_calibrate(good_landmark_peaks, selected_reference_landmark_peaks, sample_rt_numbers, reference_rt_numbers)[source]
Placeholder. Not implemented.
- asari.chromatograms.extract_massTracks_(ms_expt, mz_tolerance_ppm=5, min_intensity=100, min_timepoints=5, min_peak_height=1000)[source]
Extract mass tracks from an object of parsed LC-MS data file. A mass track is an EIC for full RT range, without separating the mass traces of same m/z.
- Parameters:
ms_expt (pymzml.run.Reader(f)) – instance of pymzml.run.Reader(f), a parsed object of LC-MS data file
mz_tolerance_ppm (float, optional, default: 5) – m/z tolerance in part-per-million. Used to seggregate m/z regsions here.
min_intensity (float, optional, default: 100) – minimal intentsity value, needed because some instruments keep 0s
min_timepoints (int, optional, default: 5) – minimal consecutive scans to be considered real signal.
min_peak_height (float, optional, default: 1000) – a bin is not considered if the max intensity < min_peak_height.
- Returns:
Dict with keys ‘rt_numbers’, ‘rt_times’, and ‘tracks’, each with a value that is a list of
co-indexed rt_numbers, rt_times, and mass tracks respectively. Mass tracks are represented
as [( mz, np.array(intensities at full rt range) ), …]
- asari.chromatograms.extract_single_track_fullrt_length(bin, rt_length, INTENSITY_DATA_TYPE=<class 'numpy.int64'>)[source]
Build a mass track from a bin of data points already in limited m/z range. A mass track is an EIC for full RT range, without separating the mass traces. Consensus m/z is taken as the mean of median m/z and the m/z of highest intensity, to be more robust. When multiple data points exist in the same scan (same RT), max intensity is used.
- Parameters:
bin (list[tuple]) – data points, in format of [(mz_int, scan_num, intensity_int), …].
rt_length (int) – full number of scans.
INTENSITY_DATA_TYPE (any int numpy.dtype, optional, default: INTENSITY_DATA_TYPE (currently np.int64)) – default to np.int64. Being future safe, but int32 may be adequate and more efficient.
- Return type:
a massTrack as ( mz, np.array(intensities at full rt range) ).
- asari.chromatograms.get_thousandth_bins(mzTree, mz_tolerance_ppm=5, min_timepoints=5, min_peak_height=1000)[source]
Bin an mzTree into a list of data bins, to feed to bin_to_mass_tracks. These data bins can form a single mass track, or span larger m/z region if the m/z values cannot be resolved into discrete tracks here.
- Parameters:
mzTree (dict[list[tuples]]) – indexed data points, {thousandth_mz: [(mz, ii, intensity_int)…], …}. (all data points indexed by m/z to thousandth precision, i.e. 0.001 amu).
mz_tolerance_ppm (float, optional, default: 5) – m/z tolerance in part-per-million. Used to seggregate m/z regsions here.
min_timepoints (int, optional, default: 5) – minimal consecutive scans to be considered real signal.
min_peak_height (float, optional, default: 1000) – a bin is not considered if the max intensity < min_peak_height.
- Return type:
a list of flexible bins, [ [(mz, scan_num, intensity_int), …], … ]
- asari.chromatograms.merge_two_mass_tracks(T1, T2)[source]
Merge two mass tracks, each massTrack as ( mz, np.array(intensities at full rt range) ).
- Parameters:
T1 (list) – a massTrack rerpesented as ( mz, np.array(intensities at full rt range))
T2 (list) – a massTrack rerpesented as ( mz, np.array(intensities at full rt range))
- Return type:
The merged mass track. The mz value is averaged between tracks, intensities are summed pair-wise.
- asari.chromatograms.remap_intensity_track(intensity_track, new, rt_cal_dict)[source]
Remap intensity_track based on rt_cal_dict, used by constructors.MassGrid.remap_intensity_track.
- Parameters:
intensity_track (list[int]) – list of intensity values from a mass track.
new (np.zeros) – new copy of np.zeros of RT length, possible longer than intensity_track, because samples may have different RT lengthes.
rt_cal_dict (dict) – sample specific mapping dictionary of RT. dictionary converting scan number in sample_rt_numbers to calibrated integer values.
- Return type:
Updated list of intensity, using coordinates in composite mass track.
- asari.chromatograms.rt_lowess_calibration(good_landmark_peaks, selected_reference_landmark_peaks, sample_rt_numbers, reference_rt_numbers, num_iterations, sample_name, outdir)[source]
This is the alignment function of retention time between samples. Use LOWESS, Locally Weighted Scatterplot Smoothing, to create correspondence between sample_rt_numbers, reference_rt_numbers. Predicted numbers are skipped when outside real sample boundaries.
- Parameters:
good_landmark_peaks (list[peak]) – landmark peaks selected from this working sample. Landmark peaks are usually defined by 13C/12C patterns.
selected_reference_landmark_peaks (list[peak]) – landmark peaks selected from the reference sample, matched and equal-length to good_landmark_peaks.
sample_rt_numbers (list) – all scan numbers in this sample.
reference_rt_numbers (list) – all scan numbers in the ref sample.
sample_name (str) – sample’s name
- Returns:
rt_cal_dict (dict) – dictionary converting scan number in sample_rt_numbers to calibrated integer values. Range matched. Only changed numbers are kept for efficiency.
reverse_rt_cal_dict (dict) – from ref RT scan numbers to sample RT scan numbers. Range matched. Only changed numbers are kept for efficiency.
Note
LOWESS method available in statsmodels.nonparametric.smoothers_lowess, v 0.12, 0.13+ https://www.statsmodels.org/stable/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html But xvals have to be forced as float array until the fix is in new release. See __hacked_lowess__.
- asari.chromatograms.rt_lowess_calibration_debug(good_landmark_peaks, selected_reference_landmark_peaks, sample_rt_numbers, reference_rt_numbers, num_iterations, sample_name, outdir)[source]
This is the debug version of rt_lowess_calibration.
- Parameters:
good_landmark_peaks (list[peak]) – landmark peaks selected from this working sample. Landmark peaks are usually defined by 13C/12C patterns.
selected_reference_landmark_peaks (list[peak]) – landmark peaks selected from the reference sample, matched and equal-length to good_landmark_peaks.
sample_rt_numbers (list) – all scan numbers in this sample.
reference_rt_numbers (list) – all scan numbers in the ref sample.
sample_name (str) – sample’s name
- Returns:
rt_cal_dict (dict) – dictionary converting scan number in sample_rt_numbers to calibrated integer values. Range matched. Only changed numbers are kept for efficiency.
reverse_rt_cal_dict (dict) – from ref RT scan numbers to sample RT scan numbers. Range matched. Only changed numbers are kept for efficiency.
Outputs
A PDF file showing the LOWESS regression result.
Note
LOWESS method available in statsmodels.nonparametric.smoothers_lowess, v 0.12, 0.13+ https://www.statsmodels.org/stable/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html But xvals have to be forced as float array until the fix is in new release. See __hacked_lowess__.
- asari.chromatograms.savitzky_golay_spline(good_landmark_peaks, selected_reference_landmark_peaks, sample_rt_numbers, reference_rt_numbers)[source]
Placeholder. Modified Savitzky-Golay filter followed by spline fitting - pls follow format in rt_lowess. Because our data are not equally spaced, sav-gol method may produce undesired errors. # UnivariateSpline can’t handle redundant values - spl = UnivariateSpline(xx, yy, ) sample.rt_calibration_function = spl # rt_remap_dict will be used for index mapping to the reference sample; for ii in sample.rt_numbers:
sample.rt_remap_dict[ii] = round(spl(ii), None)
- asari.chromatograms.smooth_lowess(list_intensity, frac=0.02)[source]
Smooth data of a very noisy mass track via LOWESS regression.
- Parameters:
list_intensity (list[int]) – list of intensity values from a mass track.
frac (float, optional, default: 0.02) – fraction of data used in LOWESS regression.
- Return type:
New list of smoothed intensity values.
Note
smooth_moving_average is preferred for most data. LOWESS is not good for small peaks.
- asari.chromatograms.smooth_moving_average(list_intensity, size=9)[source]
Smooth data of a noisy mass track using simple moving average.
- Parameters:
list_intensity (list[int]) – list of intensity values from a mass track.
size (int, optional, default: 9) – window size for moving average.
- Return type:
New list of smoothed intensity values.
Note
For very noise data, one may use smooth_lowess.