4.5. Constructors

Classes of MassGrid and CompositeMap.

class asari.constructors.CompositeMap(experiment)[source]

Each experiment is summarized into a CompositeMap (CMAP), as a master feature map. The use of CompositeMap also facilitates data visualization and exploration. Related concepts:

  1. MassGrid: a matrix for recording correspondence of mass tracks to each sample

  2. FeatureList: list of feature definitions, i.e. elution peaks defined on composite mass tracks.

  3. FeatureTable: a matrix for feature intensities per sample.

build_composite_tracks()[source]

Perform RT calibration then make composite tracks.

Updates:
  • self.good_reference_landmark_peaks – [{‘ref_id_num’: 99, ‘apex’: 211, ‘height’: 999999}, …]

  • self.composite_mass_tracks – list of composite mass tracks in this experiment.

  • sample.rt_cal_dict and sample.reverse_rt_cal_dict for all samples.

Note

See calibrate_sample_RT for details in RT alignment.

calibrate_sample_RT(sample, list_mass_tracks, calibration_fuction=<function rt_lowess_calibration>, cal_min_peak_height=100000, MIN_PEAK_NUM=15, MAX_RETENTION_SHIFT=inf, NUM_ITERATIONS=3)[source]

Calibrate/align retention time per sample.

Parameters:
  • sample (SimpleSample instance) – instance of SimpleSample class

  • list_mass_tracks (list) – list of mass tracks in sample. This may not be kept in memeory with the sample instance, thus require retrieval.

  • calibration_fuction (function, optional, default: rt_lowess_calibration) – RT calibration fuction to use, default to rt_lowess_calibration.

  • cal_min_peak_height (float, optional, default: 100000) – minimal height required for a peak to be used for calibration. Only high-quality peaks unique in each mass track are used for calibration.

  • MIN_PEAK_NUM (int, optional, default: 15) – minimal number of peaks required for calibration. Abort if not met.

Updates:
  • sample.rt_cal_dict – dictionary converting scan number in sample_rt_numbers to calibrated integer values in self.reference_sample. Range matched. Only changed numbers are kept for efficiency.

  • sample.reverse_rt_cal_dict – dictionary from ref RT scan numbers to sample RT scan numbers. Range matched. Only changed numbers are kept for efficiency.

  • sample.rt_landmarks – list of apex scan numbers for the peaks used in RT calibration.

Note

This is based on a set of unambiguous peaks: quich peak detection on anchor mass trakcs, and peaks that are unique to each track are used for RT alignment. Only numbers different btw two samples are kept in the dictionaries for computing efficiency. When calibration_fuction fails, e.g. inf on lowess_predicted, it is assumed that this sample is not amendable to computational alignment, and the sample will be attached later without adjusting retention time. It will be good to have good_landmark_peaks to cover RT range evenly in the future. Using user-supplied internal standards will be an important option.

calibrate_sample_RT_by_standards(sample)[source]

Placeholder, to add RT calibration based on spike-in compound standards.

Parameters:

sample – this will either be a SimpleSample object for the sample containing the spike-in standards.

construct_mass_grid()[source]

Constructing MassGrid for the whole experiment. If the sample number is no more than a predefined parameter (‘project_sample_number_small’, default 10), this is considered a small study and a pairwise alignment is performed. See MassGrid.build_grid_sample_wise, MassGrid.add_sample. Else, for a larger study, the mass alignment is performed by the same NN clustering method that is used in initial mass track construction. See MassGrid.build_grid_by_centroiding, MassGrid.bin_track_mzs.

Updates:
  • self._mz_landmarks_ – landmark m/z values that match to 13C/12C and Na/H patterns

  • self.MassGrid – DataFrame with reference sample as first entry. Use sample name as column identifiers.

Note

Number of samples dictate workflow: build_grid_by_centroiding is fast, but build_grid_sample_wise is used for small studies to compensate limited size for statistical distribution. All mass tracks are included at this stage, regardless if peaks are detected, because peak detection will be an improved process on the composite tracks.

export_reference_sample()[source]

Write mz and retention time of “good” ions to csv in reference sample

4.5. Results

mz,rtime 84.04437446594238,196.3507106869998 85.04770363867283,197.100775215 90.05493021011353,160.75314731200018 100.11204060912132,18.757312656 101.11540949344635,19.138889808 104.9922667145729,147.4066373920002 105.99559181928635,147.7856911519998 112.09949165582657,255.0619356640002 114.06613251566887,74.11716273600001 ……

The file name would be reference sample name + _mz_rtime_landmarks under export dir

extract_features_per_sample(sample, peak_area_function)[source]

Extract and return peak area values in a sample, based on the start and end positions defined in self.FeatureList. A peak area could be 0 if no real peak is present for a feature in this sample.

Parameters:
  • sample (SimpleSample instance) – instance of SimpleSample class.

  • peak_area_function (function) – function to be used for peak area calculation

Return type:

A list of peak area values, for all features in a sample.

generate_feature_table()[source]

Initiate and populate self.FeatureTable, each sample per column in dataframe.

get_peak_area_auc(track_intensity, left_base, right_base)[source]

Option to calculate peak area as area under the curve. This is approximated by a maximum filter to cover potential gaps.

Parameters:
  • track_intensity (np.array[dtype=INTENSITY_DATA_TYPE]) – np.array, i.e. mass_track[‘intensity’]

  • left_base (int) – index for peak left base

  • right_base (int) – index for peak right base

Return type:

Integer of peak area value

get_peak_area_gaussian(track_intensity, left_base, right_base)[source]

Option to calculate peak area by fitting the data to a gaussian model. This is

Parameters:
  • track_intensity (np.array[dtype=INTENSITY_DATA_TYPE]) – np.array, i.e. mass_track[‘intensity’]

  • left_base (int) – index for peak left base

  • right_base (int) – index for peak right base

Return type:

peak area, Integer value as gaussian integral.

get_peak_area_sum(track_intensity, left_base, right_base)[source]

Option to calculate peak area by sum of the intensity values on the track within the peak boundaries.

Parameters:
  • track_intensity (np.array[dtype=INTENSITY_DATA_TYPE]) – np.array, i.e. mass_track[‘intensity’]

  • left_base (int) – index for peak left base

  • right_base (int) – index for peak right base

Return type:

Integer of peak area value

get_reference_rtimes(rt_length)[source]

Extrapolate retention time on self.reference_sample_instance to max scan number in the experiment. This will be used to calculate retention time in the end, as intermediary steps use scan numbers.

Parameters:

rt_length (int) – this represents the total number of scans

Return type:

dictionary of scan number to retetion time in the reference_sample.

get_reference_sample_instance(reference_sample_id)[source]

Wraps the reference_sample into a SimpleSample instance, so that it have same behaivors as other samples.

Parameters:

reference_sample_id (any valid sample_id) – this is used to retrieve the sample from the experiment’s sample_registry

Return type:

instance of SimpleSample class for the reference_sample.

global_peak_detection()[source]

Detects elution peaks on composite mass tracks, resulting to a list of features. Using peaks.batch_deep_detect_elution_peaks for parallel processing.

Updates:
  • self.FeatureList – a list of JSON peaks

  • self.FeatureTable – a pandas dataframe for features across all samples.

Note

Because the composite mass tracks ar summarized on all samples, the resulting elution peaks are really features at the experiment level. Peak area and height are cumulated from all samples, not average because some peaks are in only few samples.

mock_rentention_alignment()[source]

Create empty mapping dictionaries if the RT alignment fails, e.g. for blank or exogenous samples.

set_RT_reference(cal_peak_intensity_threshold=100000)[source]

Start with the referecne samples, usually set for a sample of most landmark mass tracks. Do a quick peak detection for good peaks; use high selectivity m/z to avoid ambiguity in peak definitions.

Parameters:

cal_peak_intensity_threshold (float, optional, default: 100000) – a peak must have an intensity above this value to be used as an RT_reference

Returns:

good_reference_landmark_peaks

Return type:

[{‘ref_id_num’: 99, ‘apex’: 211, ‘height’: 999999}, …]

Note

Some members in good_reference_landmark_peaks may have the same RT apex. But the redundant numbers should be handled by rt_lowess_calibration, in which .frac is more important for stability.

class asari.constructors.MassGrid(cmap=None, experiment=None)[source]

MassGrid is the concept for m/z correspondence in asari. This shares similarity to FeatureMap in OpenMS, but the correspondence in asari takes adavantage of high m/z resolution first before feature detection.

add_sample(sample, database_cursor=None)[source]

This adds a sample to MassGrid, including the m/z alignment of the sample against the existing reference m/z values in the MassGrid.

Parameters:
  • sample (SimpleSample instance) – instance of SimpleSample class.

  • database_cursor (cursor object) – Not used now.

Updates:
  • self._mz_landmarks_ – landmark m/z values that match to 13C/12C and Na/H patterns

  • self.MassGrid – DataFrame with reference sample as first entry

  • self.experiment.all_samples – adding this sample

bin_track_mzs(tl, reference_id)[source]

Bin all track m/z values into centroids via clustering, to be used to build massGrid.

Parameters:
  • tl (list[tuple]) – sorted list of all track m/z values in experiment, [(m/z, track_id, sample_id), …]

  • reference_id (str?) – the sample_id of reference sample. Not used now.

Returns:

[ (mean_mz, [(), (), …]), (mean_mz, [(), (), …]), … ]

Return type:

list of bins

Note

Because the range of each bin cannot be larger than mz_tolerance, and mass tracks in each sample cannot overlap within mz_tolerance, multiple entries from the same sample in same bin will not happen. Similar to nearest neighbor (NN) clustering used in initial mass track construction.

build_grid_by_centroiding()[source]

Assemble mass grid by grouping m/z values to centroids. Each centroid can have no more than one mass track per sample. One of the two methods to build the grid. This is more efficient for large number of samples.

build_grid_sample_wise()[source]

Align one sample at a time to reference m/z grid, based on their anchor m/z tracks. One of the two methods to build the grid. This is better for reliable assembly of small number of samples.

join(M2)[source]

Placeholder. Future option to join with another MassGrid via a common reference.

Parameters:

M2 (MassGrid instance) – the mass grid to be merged with this MassGrid