MIRA 4.0
--------

MIRA 4.0 is the result of a bit more than two years of work since MIRA 3.4
came out and much has changed. A lot behind the scenes, but also interesting
things for everyone, most notably in speed and quality terms.

The most important change for users: the interface. Please absolutely do
consult the section on manifest files in the documentation! Users who have not
used the MIRA 3.9 development series ABSOLUTELY MUST read the
documentation. Read at least, in the MIRA Definitive Guide, the sections
describing the new assembly description files ("manifest files"); and have a
look at the chapters describing the results of MIRA; and the utilities in the
package:

     3.5. Configuring an assembly: files and parameters
          3.5.1. The manifest file: introduction
          3.5.2. The manifest file: basics
          3.5.3. The manifest file: defining the data to load
       9. Working with the results of MIRA
      10. Utilities in the MIRA package

Furthermore, you might want to skim through the following chapters:

       4. Preparing data
       5. De-novo assemblies
       6. Mapping assemblies
       7. EST / RNASeq assemblies
       8. Parameters for special situations


Main improvements made to simplify life:
- flexibilised parametrisation to easily define input data and assembly job:
  the "manifest" configuration files allow using concepts of read groups as
  well as segment orientation & segment placement
- new "fire & forget" mode of MIRA which basically should reduce misassemblies
  in result files to zero: earlier version of MIRA would dump out
  misassemblied contigs (with markers pointing at the misassembly), now
  contigs dumped out do not contain any misassembly (at least none which MIRA
  could discover).
- possibility to have MIRA determine automatically paired library parameters
  (size & orientation)
- new automatic extraction of "large contigs" at the end of a genome de-novo
  assembly.
- SAM output via "miraconvert", simplifies interaction with outside world
  (gap5, tablet etc.pp)
- full GFF3 input and output compatibility, using Sequence Ontology,
  translation to and from gap4/gap5
- CASAVA 1.8 read naming for the new Illumina read name scheme
- new sequencing "technology" TEXT for unspecified data, i.e., from databases
  like NCBI etc.
- new companion program "mirabait", which is a "grep" like utility for
  retrieving sequences by in-silico baiting
- automatic estimation of parameters for paired libraries
- lots of other improvements left and right which add up ;-)

Main speed improvements:
- faster contig handling routines, improves de-novo assembly times with Ion
  Torrent or 454 / Solexa hybrid by 30%
- faster mapping routines, allows MIRA to more or less gracefully handle
  projects with several thousand reference sequences. Useful for mapping
  against EST / RNASeq assemblies.
- faster handling of deep coverage RNASeq and genome data (to be improved
  still)
- faster kmer counting & smaller footprint in RAM and on disk
- faster checking of template restrictions
- optimized: faster data reading, does not need to count reads beforehand
  anymore
- lots of other improvements left and right which add up ;-)

Main assembly quality improvements:
- improved assembly quality for ultra-high coverage Solexa RNASeq data contigs
  (new parameter -CL:rkm)
- better handling of sequencing libraries with uneven coverage (Nextera)
- lots of other improvements left and right which add up, like, handling of
  newer MiSeq and Ion data, lossless digital normalisation etc.pp ;-)

Documentation
- rewritten in large parts for manifest files
- started to update walkthrough for newer public data sets.

Other internal changes:
- lots of reshuffled/reworked code
- improved gcc warnings
- improved build environment, now out-of-the box building on more Linux
  distros and Mac OSX
- better and simplified code due to transitioning to C++11
- new MAF format v2



Changes for the 3.9.x development line since MIRA 3.4.0:
========================================================
4.0
---
- improvement: missing or additional backslashes in "parameters=" manifest
  lines are diagnosed a bit better.
- increased max length of reads allowed in mapping to 32kb
- fine-tuning of standard parameters for PCBIOHQ reads
- fine-tuning of contig building in low-coverage areas of a contig
- fine-tuning of standard parameters for 3' polybase clipping
- bugfix: SAM output for mapping assemblies was sometimes broken (bug
  uncovered by new parameter -SB:tor=yes)
- bugfix: the term "exclusion_criterion" for "segment_placement" in manifest
  files was parsed wrongly.
- bugfix: miraconvert now does not munch away part of filenames separated by
  dots.
- bugfix: using digital normalisation on genome assemblies led to an error,
  fixed.


4.0rc5
------
- major improvements in contig building for genomes with really nasty repeats,
  leading to more accurate contigs especially with paired data
- interface change: the "--highlyrepetitive" keyword has been replace by
  "--hirep_something", "--hirep_good" and "--hirep_best"
- improvement: the proposal for "large contigs" in de-novo assemblies extracts
  a better subset of contigs to represent the "true" genome, especially in
  hybrid assemblies.
- improvement: automatic estimation of parameters for paired libraries,
  switched on via "autopairing" keyword in manifest files
- improvement: in mapping assemblies better alignment in cases where the
  mapped data contains non-clonal sequencing data or data which has more indels.
- improvement: the automatic sequencing error editor for "trivial" errors now
  corrects more cases
- improvement: new command line parameter -t for setting number of threads
  (overwriting -GE:not if set in manifest)
- improvement: new command line parameter -m and -M for checking manifest
  files without performing an assembly
- improvement: poly-A stretches in EST / RNASeq assemblies are kept by default
  in the sequences to differentiate between different 3' UTR endings
- improvement: the routines searching for chimeras were made more resistant to
  finding wrong chimeras in "low" coverage data
- improvement: skim now filtering hits saved on disk a bit faster for de-novo
  assemblies
- improvement: the misassembly detector for libraries with paired reads is on
  by default (temporary parameter: -MI:ef3)
- improvement: detection (and warning) if de-novo assembly shows signs of a
  skewed coverage over a genome like it occurs, e.g., in sequencing data
  sampled from exponential growth phase of bacteria.
- improvement: started to log warnings to separate info files
- bugfix: extraction of large contigs only done for genome de-novo assemblies
- bugfix: poly-A stretches were annotated as polyA_signal_sequence instead of
  polyA_sequence in GFF3 files.
- bugfix: non-repetitive overlaps in reads with masked stretches (-SK:mnr=yes)
  are less likely to be dismissed, leading to better assembly of heavy
  repeats.
- new parameters -ED:mace:eks:ehpo for more fine grained control of automatic
  editor
- new parameters -NW:cac:acv to catch and warn early about assemblies
  performed with overly large coverage
- new parameter -SB:bnb to allow mapping without bootstrapping
- minor: mirabait can no re-use precalculated kmer statistics via -L parameter
- minor: command line -v for mirabait & miramem
- minor; when compiling, the build script for MIRA now honours $CC and $CXX if
  set
- minor: updated GTAGDB file for compatibility to Staden gap4 / gap5 package
  (support directory)


4.0rc4
------
- improvement: simple variations in coverage which are not repeats are now
  handled gratiously, i.e., these do not break contig building. Seems to be
  extremely important for some Nextera libraries which can have some terribly
  uneven coverage.
- improvement: the automatic adaptor clipping was a tad too overzealous
  for some genome data and clip perfectly valid data in very rare cases.
- major bugfix: segment placements were not checked rigorously enough. E.g.,
  paired-end Illumina would also accept a mate-pair placement of its reads.
- major bugfix: paired-reads libraries were essentially broken since 3.9.11
  and reads were often assigned a wrong partner.
- improvement: the automatic overcall editor is now configurable and defaults
  to correct only 454 and Ion data.
- bugfix: some coverage values in the assembly info file contained temporary
  values from during the assembly, not from the end result.
- bugfix: when MIRA was started from a location not in path, it would not
  always automatically find "miraconvert" and fail at the extraction of large
  contigs in de-novo genome assemblies.


4.0rc3
------
- binaries now recognise themselves when run as "mira4", "mira4bait" and
  "mira4convert". Note: the canonical name should be those without "4" in the
  name though.
- small bugfix: error code was not >0 in certain situations
- mira now searches the binaries for itself and miraconvert in the directory
  the called binary was installed, then in the $PATH environment.
- change: gap4da result type now always turned off by default
- new parameters: -HK:mhpb:rkrk
- loading of files in the manifest: a file type can now be explicitely given
  via the EMBOSS-like double colon scheme. E.g.: data = fastq::file.dat
- bugfix: miraconvert -A did not work
- building: m4 configuration files now in the "m4" directory, not "config/m4"
- bugfix: the shell script to show how to extract large contigs now always
  written for genome assembly projects even if automatic extraction failed.
- data checks: MIRA now halts if it does not correctly recognise read pairs
  due to unknown read naming schemes.


4.0rc2
------
- speed improvement: extracting the consensus of an assembly via miraconvert
  is now faster by a factor between 1.3 and 4.5 (depending on number of
  strains present and a couple of other things)
- improvement: determination of "large contigs" did not handle well projects
  with corrected PacBio reads, fixed.
- convenience: when using -DI:trt, MIRA now creates temporary dir names for
  the target directory so that several runs of MIRA can be started in parallel
  using the same manifest file but the temporary data of these runs do not
  interfere with eachother
- bugfix: Smith-Waterman alignments entered an infinite loop in rare cases
- bugfix: "miraconvert -r C" produced IUPACs
- bugfix: miraconvert did not evalute the complete path of the result file but
  always wrote to current directory
- bugfix: target install missed a && in progs/Makefile.am
- bugfix: loading of assemblies (in mira and miraconvert) could fail
- bugfix: saved diginorm multiplier values in MAF files led sometimes to
  MAF parsing errors.
- bugfix: contig coverage statistics did not always correctly account for
  diginorm reads
- fix for OSX: better binaries which should behave like Unix binaries
- small fix for compiling on Cygwin


4.0rc1
------
- for de-novo assemblies, MIRA now automatically extracts 'large contigs' into
  additional result files
- new info file: "largecontigs". Only written for genome denovo.
- saving of "AllStrains" FASTA or FASTAQ now happens only if more than one
  strain is present in the assembly (mira and convert_project)
- new behaviour in mapping assemblies: reads overhanging the reference
  sequence left and right are trimmed back by default to the boundaries of the
  reference sequence. Can be controlled via new parameter -SB:tor
- added new file type in manifest: .exp for loading single EXP files (compare
  to "fofnexp")
- added new file type in manifest: ".fa". Behaves like ".fna"
- fix segfault when read access to /tmp directory is denied
- bugfix: merging of reads did sometimes not work as intended (introduced in
  3.9.2)
- bugfix: convert_project does not stop anymore on template problems
- bugfix: MAF output for projects with several strains now immediately
  convertable to SAM
- bugfix: FASTQ files in MS-DOS format led to parsing problems
- bugfix: possible endless loop in mapping projects with -AS:nop>1
- bugfix: miraconvert -n with an empty file selected all contigs/reads, now
  selects none (as one would expect)
- renaming of all major MIRA binaries to have "mira" as prefix. Most important
  change: "convert_project" is now "miraconvert"
- renamed and changed behaviour of all NAG_AND_WARN parameters. Now three
  options: stop, warn, no


dev3.9.18
---------
- improved lossless digital normalisation
- fixed bug leading to wrong acceptance of SW alignments of sequences on
  recognized repeat/SNP sites
- workaround for bug of Apple standard C libraries leading to wrong MIRA
  assemblies on OSX
- fixed bug which led MIRA 3.9.x to stop if no SCF files were found for Sanger
  data
- new parameters: -NW:sodnr:sote for more fine grained nagging/warning
  messages
- added patches from Debian Med


dev3.9.17
---------
- improvement: mira now increases stack size to allow for full length
  alignments of reads >= 15kb
- change: results in FASTA format now splitted into strain specific result
  files
- bugfix: cause for rare premature stop of assembly found and fixed
- bugfix: calculation of number of strains now corrrect
- bugfix: calculation of digital normalisation coverage estimates could be
  wrong for reads lateron determined to be chimeric


dev3.9.16
---------
- change of parameter parsing: "1" and "0" are not allowed anymore for boolean
  parameters. To make up for it, "y", "n", "t", "f" are now allowed shortcuts
- change of parameter parsing: -MI:somrnl:sonfs moved to new section
  -NAG_AND_WARN section: -NW:somrnl:sonfs
- change of parameter parsing: new section -HASHSTATISTICS. Moved several
  parameters from -SKIM to -HASHSTATISTICS
- new parameter: -CL:search_phix174:filter_phix174 to automatically search for
  and filter PhiX174 from Illumina data. Defaults: search only in genome mode,
  search and filter in EST mode
- new parameter -SK:fcem for EST/RNASeq assembly of very skewed distributions
- output: debris file in info directory now contains reason for read being
  pushed to debris
- new functionality: lossless digital normalisaton
- mira: slightly adapted contig naming scheming when using digital
  normalisation
- mira: computation of hash statistics a bit faster
- mira: hash statistics analysis now partly multithreaded
- mira: uses less memory for high-coverage genes in EST/RNASeq projects
- mira: now knows about Nextera adaptors
- mira: increased speed on very large projects with tens of millions of reads
- mira: improved output of parameters, now does not output "san" if sanger is
  not used.
- convert_project: new option -Q for default read quality. Also does not stop
  when missing quality files
- convert_project: new to-target "samnbb". In reference assemblies, leaves out
  the backbone (reference) sequence. Enhances compatibility for conversion to
  BAM.
- convert_project: SAM output now uses "255" as mapping quality instead of
  "50"
- convert_project: bugfix for correctly guessing to and from filetypes from
  filenames
- bugfix: annotations in GenBank files were sometimes garbled
- bugfix: parsing of -AS:mrpc did not work in the long version
- bugfix: repeat resolving of very deep repeats involving indels sometimes did
  not resolve repeats correctly
- bugfix: writing MAF checkpoint files for mapping assemblies
- bugfix: fixed "*sigh*" error in mapping assemblies which struck especially
  in deep mappings and when the reference was very different from mapped
  reads.
- bugfix: erroneous overcall editing in mapping assemblies
- potential bugfix for "would extend AS_skim_edges" error
- more readable error messages


dev3.9.15
---------
- further improvement of EST/RNASeq assembly: more probably reconstructs the
  most frequent splice variant as full transcript and less frequent variants
  as partial transcripts. Further reduction of intron assembly.
- bugfix of problem which led to abort of assembly process.


dev3.9.14
---------
- preprocessing and assembly algorithm improvement for MiSeq data
- improvement: avoidance of intron sequence improved for EST / RNASeq assembly
- improvement: adjustments to default parametrisation for EST / RNASeq
  assemblies lead to better results
- improvement: hybrid assembly of long and short reads (e.g. 454 & Illumina)
  now finds more contig joins
- mirabait: new output type "txt" for list of matching read names
- workaround: convert_project & mirabait now also work (like mira) when
  computer locale is broken.
- bugfix: dump of replay data did not work as intended for mapping
  assemblies.

dev3.9.13
---------
- ooops, forgot to include new file in source

dev3.9.12
---------
- fixed bug which led to stops of the assembly process
- fixed bug which led to segmentation faults in rare cases when mapping
  against a single, short backbone sequence.


dev3.9.11
---------
- several bugfixes in the assembly engine which previously led to an abort of
  the assembly process.
- changes in configure script to ensure better build compatibility on GenToo


dev3.9.10
---------
- improvement: major speed improvement for mapping projects with hybrid
  technologies where one of the technologies is Illumina
- improvement: preprocessing of read now fully multi-threaded
- improvement: now reads faulty GenBank files written by CloneManager
- convert_project: comfort functions. Parameters -f and -t now not mandatory
  anymore but deduced from given file names
- convert_project: now also understands .gbk as filetype for GenBank data
- new/old functionality: reactivated, MAF & CAF files can now be used as again
  as input in manifest files
- change: some warnings caused by internal errors now cause a full MIRA stop
  to hunt down the cause of the problems.
- change: in mapping assemblies, template_size and segment_placement now use
  "infoonly" as default if not given.
- change: MIRA version written to assembly info file
- several internal changes to loading system for future enhancements
- several internal code cleanups
- bugfix: mira -c did not work as described
- bugfix: some reads could sometimes not be correctly aligned, problem
  triggered by many indels in reads (Ion Torrent, 454) of high coverage
  contigs.
- bugfix: -highlyrepetitive in manifest parameters caused a parsing error
- documentation: added "default_qual" in documentation for manifest files

dev3.9.9
--------
- bugfix: MIRA sometimes stopped when building low coverage contigs


dev3.9.8
--------
- bugfix: on machines with heavy load, some files and directories sometimes
  failed to be deleted. Fixed.
- bugfix: mapping of Illumina at contig ends sometimes failed (since 3.9.6)
- bugfix: error causing "gaaah, had doubles" in de-novo assemblies of genomes
  fixed.
- small runtime improvement: some unnecessary steps in the overlap graph
  traversal are now dynamically left out.
- convert_project: new option -S (currently only for Illumina data)
- convert_project: reactivated -i
- unnecessary debug files are now not written anymore to the tmp directory
- new -MI:ef2 flag for switching off experimental code


dev3.9.7
--------
- bugfix: -MI:ef1 did not work


dev3.9.6
--------
- improvement: mapping assemblies with greatly improved alignments for
  non-perfect matches, especially in larger indel regions
- improvement: assembly of repetitive areas in de-novo assemblies, no more
  "repeat coverage peaks"
- improvement: parsing of sequence identifiers in NCBI gi format automatically
  chooses the simple sequence name as read name
- improvement: can now load GFF3 files without sequences (as provided by the
  NCBI)
- improvement: data loading consistency checks added to catch illegal quality
  values >100.
- convert_project: "-t stats" now also outputs consensus tags info file
- slight change in output of info file of consensus tags (added length of tag)
- change: output of wiggle files is now always for the full, padded contig
- change: output of alignments as text or HTML now has lowercase alignments,
  uppercase for bases which do not match the consensus
- documentation: using Illumina data from the SRA
- bugfix: possible segmentation faults removed
- bugfix: -SB:abnc=true led to a problem since 3.9.5, fixed.
- bugfix: MAF parsing got phase information wrong in GFF3 tags
- bugfix: CAF writing phase info from GFF3 tags was not ASCII
- bugfix: creation of SAM files failed when libraries with FR or RF
  orientation were present.
- bugfix: coverage equivalent reads were created too short for coverages >1.


dev3.9.5
--------
- better handling of failing disk operations when memory is tight
- improved assembly of repetitive regions in genome assemblies: faster and
  non-heuristic.
- improved assembly of EST / RNASeq of eukaryotic data: intronic sequences
  should be less frequent.
- major speed improvement in de-novo assemblies higher coverages of Illumina,
  Ion Torrent and 454 data
- better clipping of degenerated Ion Torrent adaptors
- new 454 adaptor clipped
- rename -CL:mkfr to -CL:pmkfr and added automatic adjustments for high
  coverage data.
- better error handling on OSX: user now always gets a specific MIRA error
  message.
- better semantic checking of manifest files (missing reference, no data,
  etc.)
- manifest files: allowing LEFTIES and RIGHTIES in segment placement
- the 3.9.x line now recognises if it was called with parameters from the
  3.4.x line and stops
- parameter -CL:lcc was split into -CL:lccf:lccb
- bugfix: parsing MAF files with Illumina and / or Ion data leads to less used
  memory like as if loaded anew (important for resumed assemblies).
- bugfix: FASTQ loader now stops on illegal FASTQ files
- bugfix: some alignments did don get propper flags, leading to suboptimal
  assembly graphs.


dev3.9.4
--------
- bugfix: FASTQ input in Illumina-64 format was broken
- bugfix & improvement: SAM format from convert_project now more compatible to
  SAM specification
- bugfix: bugs leading to error message trackingunused != countingunused
- bugfix: loading of gzip packed FASTQ with a name ending in .gz did not work
- first speed improvements in contig cleanup routines when mapping Ion data,
  more to come in next versions
- further improvements for binary compatibility to older Linux kernels, errors
  of BOOST induced problems with locale settings should now be a thing of the
  past
- minor updates to the documentation
- new ./configure option "--enable-native" for processor specific optimised
  code


dev3.9.3
--------
- fixed ./configure to produce a fully statically linked binary on OSX
- bugfix: CAF files were not read correctly (bug due to MIRA now also saving
  placement code in CAF)
- fixed segfault for some projects when doing "convert_project -t asnp"
- fix: building without EdIt broke on some systems
- mira now advertises the resume option in "mira -h"
- improvement: distributed binary packages now compatible again to older
  kernels (down to 2.6.15)
- improvement: saving of snapshots is now much more error resistant, a working
  snapshot should always be present on disk
- improvement: SNP and feature analysis now only writes "X" as amino acid if a
  SNP leads to a codon change which resolves to a different AA.
- improvement: quicker SNP and feature analysis in intergenic regions
- improvement: parsing of MAF files with lots of tags a lot faster (50% and
  more)
- change: *_info_featuresequences.txt and *_info_featureanalysis.txt now
  include an additional column "FType" (FeatureType)


dev3.9.2
--------
- fixed possible segfault (only hit when assembling de-novo with more than one
  strain)
- fixed parsing error in manifest files when loading data from pacbio
- improved configure script for compiling on a wider variety of platforms and
  with different versions of gcc and BOOST libraries
- MIRA resume: added "-r"
- binary package for Linux: fixed issue where older Linux kernels needed to
  have 'export LC_ALL=C' or else they would not start.
