4.4. Chromatograms

Functions related to chromatograms and mass tracks. Use integers for RT scan numbers and intensities. Flexible binning based on ppm accuracy. Use mz tol (default 5 pmm) in XIC construction. XICs without neighbors within x ppm are considered specific (i.e. high selectivity). Low selectivity regions will be still inspected to determine the true number of XICs.

asari.chromatograms.bin_to_mass_tracks(bin_data_tuples, rt_length, mz_tolerance_ppm=5)[source]

Construct mass tracks from data that may exceed proper m/z range (i.e. over mz_tolerance_ppm). A mass track is an EIC for full RT range, without separating the mass traces. Imperfect ROIs will be examined in extract_massTracks_ and merged if needed.

Parameters:
  • bin_data_tuples (list[tuple]) – a flexible bin by units of 0.001 amu, in format of [(mz, scan_num, intensity_int), …]. This may or may not be within mz_tolerance_ppm.

  • rt_length (int) – full number of scans.

  • mz_tolerance_ppm (float, optional, default: 5) –

Return type:

a list of massTracks as [(mz, np.array(intensities at full rt range)), …].

asari.chromatograms.build_chromatogram_by_mz_clustering(bin_data_tuples, rt_length, mz_tolerance)[source]

Generates a list of prototype extracted ion chromatograms to be used by extract_single_track_fullrt_length.

Parameters:
  • bin_data_tuples (list[tuple]) – a flexible bin in format of [(mz, scan_num, intensity_int), …].

  • rt_length (int) – number of scans (NOT USED, remove from call signature in future)

  • mz_tolerance (float) – precomputed based on m/z and ppm, e.g. 5 ppm of 80 = 0.0004; 5 ppm of 800 = 0.0040.

Return type:

A list of clusters as separated bins, each bin as [(mz, scan_num, intensity_int), …]

asari.chromatograms.build_chromatogram_intensity_aware(bin_data_tuples, rt_length, mz_tolerance)[source]

Chromatogram builder as in ADAP for testing. Start with highest intensity, going down by intensity and include data points within mz_tolerance. Repeat until no track is detected. Without requiring continuous RT, which is handled in extract_single_track_fullrt_length. Default in asari is build_chromatogram_by_mz_clustering.

Parameters:
  • bin_data_tuples (list[tuple]) – a flexible bin in format of [(mz, scan_num, intensity_int), …].

  • rt_length (int) – number of scans (NOT USED, remove from call signature in future)

  • mz_tolerance_ppm (float) – m/z tolerance in part-per-million. Used to seggregate m/z regions here.

Returns:

separated bins of [(mz, scan_num, intensity_int), …], prototype of extracted ion chromatograms to be used by extract_single_track_fullrt_length.

Return type:

assigned

asari.chromatograms.clean_rt_calibration_points(rt_cal_pairs)[source]

Remove redundant RT calibration data points and outliers (out of 3x stdev). This does not force even distribution of calibration data points.

Parameters:

rt_cal_pairs (list of paired scan numbers from this sample and from the reference sample.) –

Returns:

rt_cal_pairs

Return type:

clean and sorted version of rt_cal_pairs.

asari.chromatograms.dwt_rt_calibrate(good_landmark_peaks, selected_reference_landmark_peaks, sample_rt_numbers, reference_rt_numbers)[source]

Placeholder. Not implemented.

asari.chromatograms.extract_massTracks_(ms_expt, mz_tolerance_ppm=5, min_intensity=100, min_timepoints=5, min_peak_height=1000)[source]

Extract mass tracks from an object of parsed LC-MS data file. A mass track is an EIC for full RT range, without separating the mass traces of same m/z.

Parameters:
  • ms_expt (pymzml.run.Reader(f)) – instance of pymzml.run.Reader(f), a parsed object of LC-MS data file

  • mz_tolerance_ppm (float, optional, default: 5) – m/z tolerance in part-per-million. Used to seggregate m/z regsions here.

  • min_intensity (float, optional, default: 100) – minimal intentsity value, needed because some instruments keep 0s

  • min_timepoints (int, optional, default: 5) – minimal consecutive scans to be considered real signal.

  • min_peak_height (float, optional, default: 1000) – a bin is not considered if the max intensity < min_peak_height.

Returns:

  • Dict with keys ‘rt_numbers’, ‘rt_times’, and ‘tracks’, each with a value that is a list of

  • co-indexed rt_numbers, rt_times, and mass tracks respectively. Mass tracks are represented

  • as [( mz, np.array(intensities at full rt range) ), …]

asari.chromatograms.extract_single_track_fullrt_length(bin, rt_length, INTENSITY_DATA_TYPE=<class 'numpy.int64'>)[source]

Build a mass track from a bin of data points already in limited m/z range. A mass track is an EIC for full RT range, without separating the mass traces. Consensus m/z is taken as the mean of median m/z and the m/z of highest intensity, to be more robust. When multiple data points exist in the same scan (same RT), max intensity is used.

Parameters:
  • bin (list[tuple]) – data points, in format of [(mz_int, scan_num, intensity_int), …].

  • rt_length (int) – full number of scans.

  • INTENSITY_DATA_TYPE (any int numpy.dtype, optional, default: INTENSITY_DATA_TYPE (currently np.int64)) – default to np.int64. Being future safe, but int32 may be adequate and more efficient.

Return type:

a massTrack as ( mz, np.array(intensities at full rt range) ).

asari.chromatograms.get_thousandth_bins(mzTree, mz_tolerance_ppm=5, min_timepoints=5, min_peak_height=1000)[source]

Bin an mzTree into a list of data bins, to feed to bin_to_mass_tracks. These data bins can form a single mass track, or span larger m/z region if the m/z values cannot be resolved into discrete tracks here.

Parameters:
  • mzTree (dict[list[tuples]]) – indexed data points, {thousandth_mz: [(mz, ii, intensity_int)…], …}. (all data points indexed by m/z to thousandth precision, i.e. 0.001 amu).

  • mz_tolerance_ppm (float, optional, default: 5) – m/z tolerance in part-per-million. Used to seggregate m/z regsions here.

  • min_timepoints (int, optional, default: 5) – minimal consecutive scans to be considered real signal.

  • min_peak_height (float, optional, default: 1000) – a bin is not considered if the max intensity < min_peak_height.

Return type:

a list of flexible bins, [ [(mz, scan_num, intensity_int), …], … ]

asari.chromatograms.merge_two_mass_tracks(T1, T2)[source]

Merge two mass tracks, each massTrack as ( mz, np.array(intensities at full rt range) ).

Parameters:
  • T1 (list) – a massTrack rerpesented as ( mz, np.array(intensities at full rt range))

  • T2 (list) – a massTrack rerpesented as ( mz, np.array(intensities at full rt range))

Return type:

The merged mass track. The mz value is averaged between tracks, intensities are summed pair-wise.

asari.chromatograms.remap_intensity_track(intensity_track, new, rt_cal_dict)[source]

Remap intensity_track based on rt_cal_dict, used by constructors.MassGrid.remap_intensity_track.

Parameters:
  • intensity_track (list[int]) – list of intensity values from a mass track.

  • new (np.zeros) – new copy of np.zeros of RT length, possible longer than intensity_track, because samples may have different RT lengthes.

  • rt_cal_dict (dict) – sample specific mapping dictionary of RT. dictionary converting scan number in sample_rt_numbers to calibrated integer values.

Return type:

Updated list of intensity, using coordinates in composite mass track.

asari.chromatograms.rt_lowess_calibration(good_landmark_peaks, selected_reference_landmark_peaks, sample_rt_numbers, reference_rt_numbers, num_iterations, sample_name, outdir)[source]

This is the alignment function of retention time between samples. Use LOWESS, Locally Weighted Scatterplot Smoothing, to create correspondence between sample_rt_numbers, reference_rt_numbers. Predicted numbers are skipped when outside real sample boundaries.

Parameters:
  • good_landmark_peaks (list[peak]) – landmark peaks selected from this working sample. Landmark peaks are usually defined by 13C/12C patterns.

  • selected_reference_landmark_peaks (list[peak]) – landmark peaks selected from the reference sample, matched and equal-length to good_landmark_peaks.

  • sample_rt_numbers (list) – all scan numbers in this sample.

  • reference_rt_numbers (list) – all scan numbers in the ref sample.

  • sample_name (str) – sample’s name

Returns:

  • rt_cal_dict (dict) – dictionary converting scan number in sample_rt_numbers to calibrated integer values. Range matched. Only changed numbers are kept for efficiency.

  • reverse_rt_cal_dict (dict) – from ref RT scan numbers to sample RT scan numbers. Range matched. Only changed numbers are kept for efficiency.

Note

LOWESS method available in statsmodels.nonparametric.smoothers_lowess, v 0.12, 0.13+ https://www.statsmodels.org/stable/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html But xvals have to be forced as float array until the fix is in new release. See __hacked_lowess__.

asari.chromatograms.rt_lowess_calibration_debug(good_landmark_peaks, selected_reference_landmark_peaks, sample_rt_numbers, reference_rt_numbers, num_iterations, sample_name, outdir)[source]

This is the debug version of rt_lowess_calibration.

Parameters:
  • good_landmark_peaks (list[peak]) – landmark peaks selected from this working sample. Landmark peaks are usually defined by 13C/12C patterns.

  • selected_reference_landmark_peaks (list[peak]) – landmark peaks selected from the reference sample, matched and equal-length to good_landmark_peaks.

  • sample_rt_numbers (list) – all scan numbers in this sample.

  • reference_rt_numbers (list) – all scan numbers in the ref sample.

  • sample_name (str) – sample’s name

Returns:

  • rt_cal_dict (dict) – dictionary converting scan number in sample_rt_numbers to calibrated integer values. Range matched. Only changed numbers are kept for efficiency.

  • reverse_rt_cal_dict (dict) – from ref RT scan numbers to sample RT scan numbers. Range matched. Only changed numbers are kept for efficiency.

Outputs

A PDF file showing the LOWESS regression result.

Note

LOWESS method available in statsmodels.nonparametric.smoothers_lowess, v 0.12, 0.13+ https://www.statsmodels.org/stable/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html But xvals have to be forced as float array until the fix is in new release. See __hacked_lowess__.

asari.chromatograms.savitzky_golay_spline(good_landmark_peaks, selected_reference_landmark_peaks, sample_rt_numbers, reference_rt_numbers)[source]

Placeholder. Modified Savitzky-Golay filter followed by spline fitting - pls follow format in rt_lowess. Because our data are not equally spaced, sav-gol method may produce undesired errors. # UnivariateSpline can’t handle redundant values - spl = UnivariateSpline(xx, yy, ) sample.rt_calibration_function = spl # rt_remap_dict will be used for index mapping to the reference sample; for ii in sample.rt_numbers:

sample.rt_remap_dict[ii] = round(spl(ii), None)

asari.chromatograms.smooth_lowess(list_intensity, frac=0.02)[source]

Smooth data of a very noisy mass track via LOWESS regression.

Parameters:
  • list_intensity (list[int]) – list of intensity values from a mass track.

  • frac (float, optional, default: 0.02) – fraction of data used in LOWESS regression.

Return type:

New list of smoothed intensity values.

Note

smooth_moving_average is preferred for most data. LOWESS is not good for small peaks.

asari.chromatograms.smooth_moving_average(list_intensity, size=9)[source]

Smooth data of a noisy mass track using simple moving average.

Parameters:
  • list_intensity (list[int]) – list of intensity values from a mass track.

  • size (int, optional, default: 9) – window size for moving average.

Return type:

New list of smoothed intensity values.

Note

For very noise data, one may use smooth_lowess.