ATLAS in silico: Data & Mappings  
Data & Mappings
ATLAS in silico is created from the Global Ocean Survey (GOS) oceanic microorganism metagenomics dataset, GOS metadata, and contextual socio-economic-environmental data. In this section the datasets are described along with the visual and auditory encodings that create the mappings from data to procedural graphics and sound within the artwork.
  1. Global Ocean Sampling Expedition (GOS) Data
  2. Multiple Scales
  3. GOS Metadata and Contextual Socio-economic and Environmental Data
  4. Mappings
  5. MDE: Scalable Metadata Environments
  6. SGO: Meta-shape grammar objects
  7. SADS: Scalable auditory data signatures
  8. BLAST analysis within ATLAS in silico
  9. References
Go to top of page Global Ocean Sampling Expedition (GOS) Data  
The Global Ocean Sampling Expedition (GOS) (2003 - 2010) conducted by the J. Craig Venter Institute (, studies the genetics of communities of marine microorganisms throughout the worlds oceans. Oceanic microorganisms sequester carbon from the atmosphere with significant impacts on global climate. The mechanisms for this biogeochemical process have yet to be fully understood. Findings from the first GOS release (Parthasarathy, Hill and MacCallum, 2007) were published in PLOS in a dedicated Ocean Metagenomics Collection that includes a section on CAMERA, the community resource for metagenomics located at UCSD Calit2. Scientific collaborators on the ATLAS in silico project include the executive director of CAMERA and bioinformatics researchers that developed algorithms for analysis of GOS data. They provided the GOS dataset to the author.
Uncultured organisms: Samples of ~200 liter of sea water containing communities of microorganisms (approximately 1 million microorganisms per milliliter of sea water) within their oceanic habitats at hundreds of locations within the world's oceans (J. Craig Venter Institute, 2004, 2010a, 2010b, no date) were collected by the GOS team. DNA for each ~200L sample is extracted by whole genome environmental shotgun sequencing to overcome the inability to culture organisms in the laboratory. By not culturing the organisms, and directly sequencing pooled samples (community of microorganisms at a site), selection bias is avoided. Direct sampling creates a snapshot of microorganismal biodiversity at the sampling time, location, and under the recorded conditions. The recorded conditions for each sample established a comprehensive set of metadata to contextualize the GOS data.
Sequences: Nucleotide sequences are computationally reconstructed and analyzed from the pooled sequence reads for each sample. The process includes raw DNA reads of approximately 1,000 base pairs computationally assembled into longer nucleotide sequences known as "assemblies." These are then analyzed to compute open reading frames, or ORF sequences, which are potential amino acid sequences read in all possible "reading frames" along the sense and non-sense strand of DNA molecules. Bioinformatics analyses conducted by GOS/CAMERA researchers compared the novel GOS sequences with all known, non-redundant sequences in existing genomic repositories, such as NCBI (National Center for Biotechnology Information, TGI-EST (TIGR Gene Indices)(Quackenbush, J. et al. (2001)), ENS (Ensembl) (Birney et al., 2006) etc. in order to assign the sequences from the GOS to kingdoms (prokaryotic, viral, eukaryotic, archaea) (Yooseph, Li and Sutton, 2008). These and additional bioinformatics analyses yield idenfification of new genes and genomes for organisms that have not been cultivated in the laboratory, as well as new organisms/lineages. Ongoing analysis of the GOS continues to yeild insight into what conditions drive microbial biodiversity.
The GOS dataset is published (deposited into publicly accssible repositories) in multiple releases. The first release (the entirety of whic is used to create ATLAS in silico ) containes 17.4 million open reading frames (ORFs), DNA sequences for predicted amino acid sequences (proteins). The ORF amino acid sequence lengths range from ~60 to ~7,000 amino acids. The scale and significance of the GOS is evidenced by the publication of this first release resulting in doubling the number of protein sequences in publicly accessible genomic repositories as compared to the total protein sequences which had been accrued over the prior 30 years by the scientific community as of the date that the first release of the dataset was made publicly available (Yooseph et al., 2007; marine metagenome (ID 13694) - BioProject - NCBI, no date). Figure ### below is an example of a record from the GOS dataset used in ATLAS in silico.
Figure: Structure of a GOS dataset record from the first release. Each database record contains three broad categories of data, each with values at multiple physical, sptial, temporal, biological or informational scales:1) Identifiers and metadata (top) which include ORF and seuqence read ID numbers, as well as metaedata such as location of the sample, nearest country, ocean region, and habitat type, or measurements such as chlorophyll, salinity, temperature and pH at the sample lat/lon and depth; 2) IP Notice (middle) that reflects the MOU (memoranda of understanding) established between the JCVI expedition and countries (Governments) nearest the sample sites, these are in addition to research permits and sample export permits issued by countries, to specify the possible uses of genetic materials; and 3) Sequence data (bottom) which includes ORF (amino acid) sequences and also the corresponding DNA sequence.
Go to top of page Multiple Scales  
Qualitative scale: GOS data, the metadata recorded for each sample site, and additional contextual data either computed from GOS data and metadata values or sourced online contain values at a multiplicity of scales. These span invisible "informational" scales, through molecular, biological, physical, spatial, temporal, ecological, socio-economic and environmental scales. The table below summarizes the categories of data used to create the artwork according a set of qualitative scale descriptors.
Qualitative Scale GOS Data GOS Metadata Contextual Data
Informational / Molecular
Nucleic acid (DNA) sequence reads
ORF (amino acid) sequences predicted from open reading frames of the DNA sequence
Biochemical / Biological
Identification and classification e.g. gene identification, protein classification and function etc. via bioinformatics analyses
Biophysiochemical properties of predicted ORF (aa) sequences. E.g. secondary or tertiary structure, polarity, isoelectric point, hydrophobicity, side-chain size, etc. computed using Biophython and other tools.
Sample collection: date, start time, stop time
Latitude, longitude
Sample location
Geographic location
Sample depth
Water depth
Habitat Type
Chlorophyll density
IP Notice
MOU, permits
Infant mortality per thousand for MOU country nearest sample site
Internet users per capita for MOU country nearest sample site
    CO2 emissions per capita (metric tons) for MOU country nearest sample site
Table: Each database record contains GOS data and associated metadata annotations. This data is further annodated with socio-economic and environmental data for regions nearest the sampling site, as per the sampline permits and MOUs and with features computed directly from GOS data or metadata values.
Go to top of page GOS Metadata and Contextual Socio-economic and Environmental Data  
GOS Metadata: GOS samples were collected by the J. Craig Venter institute aboard the custom built sailing yacht, the Sorcerer II, its route inpsired by the HMS Beagle and HMS Challenger circumnavigations. Starting in Halifax, Canada, sites along the US East coast, the Gulf of Mexico, Galapagos Islands, central and south Pacific oceans, Australia, the Indian Ocean, South Africa and throughout the Atlantic were sampled (J. Craig Venter Institute, 2010a). Samples were collected at multiple depths and multiple habitat types (e.g. open ocean, costal, estuary etc. ). As shown in figure ### above, metadata values recorded at each sample location for the GOS data used in the project include: 1) habitat type, 2) geographic location, 3) sample location, 4) country, 5) latitude, longitude, 6) sample depth, 7) water depth, 8) chlorophyll density, 9) salinity, 10) temperature, 11) pH, 12) start/stop time of collection. Addtionally time and date of a sample is correlated with satellite imagery. GOS metadata contains data with values at multiple scales in addition to the scales present in the primary data within each record.
Contextual Data: Selected socio-economic and environmental data corresponding to countries or locations where samples were collected is compiled from online resources to further contextualize the GOS data and metadata. Additionally, biophysiochemical features of GOS ORF (amino acid) sequences are computed using Biophyton and other tools.
Go to top of page Mappings  
The GOS data, GOS metadata and contextual data is visually and sonically encoded in a series of interrelated poetic mappings to construct ATLAS in silicio's interactive and generative virtual environment. These mappings are comprised of 1) the scalable metadata environment (MDE), 2) meta-shape grammar objects (SGO), and 3) scalable auditory data signatures (SADS). Each mapping is described below along with relevant publications and links to PDFs in the Texts section.
Go to top of page Scalable Metadata Environment (MDE)  
The overarching mapping is the construction of the ATLAS in silico virtual environment from GOS metadata and contextual environmental data within a scalable metadata environment (MDE). An MDE is a novel artistic/poetic approach to the design and construction of scalable 4D generative and interactive virtual environments from large and multidimensional datasets developed by the author and published in West et al. (2014) (See PDF in texts).
Within the MDE virtual world constructed from GOS metadata and contextual data as an unbounded universe, all of the graphics and audio that participants experience and that result from their interaction are procedurally generated at run time from GOS data, computed features of GOS sequences, GOS metadata and contextual data that are mapped onto scalable graphical (SGO) (West, R. et al. 2009a, 2009b) (See PDF ) and auditory (SADS) (Gossmann, J. et al. (2008) (see PDF) representations. SGOs and SDADS are generated from overlapping sets of data and SADS are auditory analogs to the SGOs.
Figure: ATLAS in silico MDE prior to GOS data being placed into the environment. The environment is structured with 8 metatdata 3D regions in a 2 x 4 arrangement. Regions are: (from top left to right): Habitat type, Temperature, Salinity, and Chlorophyll concentration, and (bottom left to right) Amino acid length, Sample location, Sample depth and CO2 per capita. Each region is constructed from GOS metadata and lat/lon coordintaes for sample sites from the first release of GOS data. Within the MDE environment a fluid dynamics simulation is running but not yet visible as there are no data particles (GOS database records) in the virtual world. GOS data is placed into the MDE environment at random locations with each database record represented as a colored particle. (see video at ~ 0:16 seconds) The fluid dynamics simulation is active and as particles move within it based on the value of the GOS data in the datbase record it represents and ieach particle's metadata values for each respective metadata region within the environment, dynamic patterns emerge within the data fluid. Video (49 seconds; 16 seconds silent, then audio starts when data loads and moves within the fluid)
Figure: Rotation flythrough of the ATLAS in silico metadata environment. GOS data is moving between metadata regions in a "data fluid." ( Video 20 seconds)
Figure: Video (1:21 duration) MDE is reconfigurable at runtime. Move regions to alternate locations and new patterns emerge. Navigate and interact same as in rectangular region layout. This video is generated by screencapture via a linux capture card on the graphics machine.
Figure: Video (2:23 seconds) Video capture of participant interacting with the reconfigured ATLAS in silico MDE environment. Video is of rear projected stereoscopic 3D system using passive 3D glasses and Flock of Birds tracking.
Figure: GOS dataset (17.4M records) within the MDE with detail call out.
Figure: GOS dataset filter operation applied to select only those GOS records that assemble at multiple GOS sites within the first release dataset with detail call out. This subset of records exists wtihin the full GOS dataset, yet is not easily visible without the filter operation.
Go to top of page Meta-shape Grammar Objects (SGO)  
Figure: This figure shows the process whereby a subset of GOS records is selected from the data/particle fluid, is enlarged to show their detail as shape grammar objects, and fills the entire virtual enviornment as a grid-like arrangement, similar to the atlas plates for physical specimens rendered by naturalist illutrators. The shape grammar object in the middle section that is highlighted by the circle is the object selected by the participant, after having performed a comparative analysis through embodied gestures. The selected object is further enlarged and moves forward in front of the display surface onto the body of the participant in the tracked volume. The data that is contained in the object and its sonic signature (scalable auditory data signature) is read off by a text to speech engine while the participant interacts with the detailed object.
Figure: An enlarged view of the selected objects showing detail. The white filament-like structures in the image is the data/particle fluid. It is paused during this phase, and provides context for this interaction. The entire GOS dataset becomse context for the comparative analsis and the subsequent "detail in context" exploration of the individual object (GOS database record).
Figure: SGO symmetry examples - 2, 4, 3, 5 and 6 symmetry levels for the shape grammar objects. Symmetry is determined through calculating features along the amino acide sequence.
Go to top of page Scalable Auditory Data Signatures (SADS)  
Figure: Scalable auditory data signarure (SADS) from ATLAS in silico. Scalable auditory data signatures layer dimensions of sound that map to multiple dimensions of data from the Global Ocean Sampling expedition. SADS are auditory representations that accomodate quantitative and qualitative listening by encapsulating different depths of data through a combination of temporal and spatial scaling. See Gossman et al. 2008 in the Texts section.
Go to top of page BLAST analysis within ATLAS in silico  
Figure: BLAST analysis results for GOS data within CAMERA portal are returned with associated metadata values in a tabular format.
Figure: BLAST analysis results from CAMERA portal returned with associated metadata within ATLAS in silico.
Video: (35 seconds) BLAST analysis results returned within ATLAS in silico MDE environment.
Figure: Video (26 sesconds) Participant exploring BLAST analysis returned wtihin ATLAS in silico
Go to top of page References  
spacer170px spacer600px spacer200px