4.12. Workflow
ext_Experiment is the container for whole project data. sample_registry is the dict to track sample information and status, outside class to facilitate multiprocessing. Heavy lifting is in constructors.CompositeMap,
which contains MassGrid for correspondence, and FeatureList from feature/peak detection.
Annotation is facilitated by jms-metabolite-services, mass2chem.
- asari.workflow.batch_EIC_from_samples_(sample_registry, parameters)[source]
Batch extraction of mass tracks from samples via multiprocessing.
- Parameters:
sample_registry (dict) – dictionary like {‘sample_id’: ii, ‘input_file’: file}, generated by register_samples.
parameters (dict) – parameter dictionary passed from main.py, which imports from default_parameters and updates the dict by user arguments.
- Returns:
shared_dict – dictionary object used to pass data btw multiple processing.
- Return type:
dict
See also
- asari.workflow.create_export_folders(parameters, time_stamp)[source]
Creates local directory for storing temporary files and output result. A time stamp is added to directory name to avoid overwriting existing projects.
- Parameters:
paramaters (dict) – passed from main.py to get outdir and project_name
time_stamp (str) – a time_stamp string to prevent overwriting existing projects
- asari.workflow.get_mz_list(infile)[source]
Get a list of m/z valuies from infile, to be used as targets to extract from LC-MS features. Currently, extract is a short cut function - asari still processes full feature tables, then searchs for the given targets.
- Parameters:
infile (str) – filepath to input table, which is tab or comma delimited and has m/z in the first column, header as first row.
- Return type:
A list of m/z values.
- asari.workflow.make_iter_parameters(sample_registry, parameters, shared_dict)[source]
Generate iterables for multiprocess.starmap for getting sample mass tracks.
- Parameters:
sample_registry (dict) – dictionary like {‘sample_id’: ii, ‘input_file’: file}, generated by register_samples.
parameters (dict) – parameter dictionary passed from main.py, which imports from default_parameters and updates the dict by user arguments.
shared_dict (dict) – dictionary object used to pass data btw multiple processing.
- Returns:
A list of iterative parameters, e.g.
[(‘sample_id’, input_file, mode, mz_tolerance_ppm, min_intensity, min_timepoints,
min_peak_height, output_file, shared_dict), …]
- asari.workflow.process_project(list_input_files, parameters)[source]
This defines the main work flow in processing a list of LC-MS files, creates the output folder with a time stamp, and uses sample registry to coordinate parallel processing. The whole project data are tracked in experiment.ext_Experiment class.
- Parameters:
list_input_files (list[str]) – list of centroided mzML filepaths from LC-MS metabolomics. Usually found in a folder.
parameters (dict) – parameter dictionary passed from main.py, which imports from default_parameters and updates the dict by user arguments.
Outputs
A local folder with asari processing result, e.g:
rsvstudy_asari_project_427105156 ├── Annotated_empricalCompounds.json ├── Feature_annotation.tsv ├── export │ ├── _mass_grid_mapping.csv │ ├── cmap.pickle │ ├── full_Feature_table.tsv │ └── unique_compound__Feature_table.tsv ├── pickle │ ├── Blank_20210803_003.pickle │ ├── ... ├── preferred_Feature_table.tsv └── project.json
The pickle folder is removed after the processing by default.
- asari.workflow.process_xics(list_input_files, parameters)[source]
Get XICs (aka EICs or mass tracks) from a folder of centroid mzML files and store in local pickle files.
- Parameters:
list_input_files (list[str]) – list of centroided mzML filepaths from LC-MS metabolomics. Usually found in a folder.
parameters (dict) – parameter dictionary passed from main.py, which imports from default_parameters and updates the dict by user arguments.
Outputs
A local folder with asari extracted XICs (aka EICs or mass tracks), without full processing, in pickle files.
- asari.workflow.read_project_dir(directory, file_pattern='.mzML')[source]
This reads centroided LC-MS files from directory. Returns a list of files that match file_pattern.
- Parameters:
directory (str) – path to a directory containing mzML files
file_pattern (str, optional, default: '.mzML') – files with this substring will be ingested
- Return type:
list of paths to files containing the file_pattern
- asari.workflow.register_samples(list_input_files)[source]
Establish sample_id here, return sample_registry as a dictionary.
- Parameters:
list_input_files (list[str]) – list of input filepaths, each representing a sample
- Return type:
sample_registry, a dictionary of integer sample id’s to filepaths
- asari.workflow.remove_intermediate_pickles(parameters)[source]
Remove all temporary files under pickle/ to free up disk space.
- Parameters:
paramaters (dict) – passed from main.py to get tmp_pickle_dir
- asari.workflow.single_sample_EICs_(sample_id, infile, ion_mode, database_mode, mz_tolerance_ppm, min_intensity, min_timepoints, min_peak_height, outfile, shared_dict)[source]
Extraction of mass tracks from a single sample. Used by multiprocessing in batch_EIC_from_samples_. shared_dict is used to pass back information, thus critical. Designed here as sample_id:
(‘status:mzml_parsing’, ‘status:eic’, outfile, max_scan_number, list_scan_numbers, list_retention_time, track_mzs, number_anchor_mz_pairs, anchor_mz_pairs, dict({mass tracks}) )
track_mzs or anchor_mz_pairs are used later for aligning m/z tracks.
list of scans starts from 0.
anchor_mz_pairs are defined as m/z pairs that match to 13C/12C pattern. More anchors mean better coverage of features, helpful to select reference sample.
- Parameters:
sample_id (int) – sample id passed from sample_registry by make_iter_parameters.
infile (str) – input mzML filepath, passed from sample_registry by make_iter_parameters.
ion_mode (str) – from parameter dictionary, marks if acquisition was in positive or negative mode
database_mode (str) – from parameter dictionary, marks if intermediatesare kept on disk or in memory
mz_tolerance_ppm (float) – from parameter dictionray, the assumed mz resolution of the instrument
min_intensity (float) – peaks below this value are ignored
min_timepoints (int) – then number of time points a peak must span to be considered a peak
min_peak_height (float) – peaks below this height are ignored
outfile (str) – where the output will be written passed from parameter dictionary by make_iter_parameters.
shared_dict (dict) – dictionary object used to pass data btw multiple processing.
- Updates:
shared_dict (dict) – dictionary object used to pass data btw multiple processing.
See also