Class: Markdown::Merge::Cleanse::CondensedLinkRefs
- Inherits:
-
Object
- Object
- Markdown::Merge::Cleanse::CondensedLinkRefs
- Defined in:
- lib/markdown/merge/cleanse/condensed_link_refs.rb
Overview
Parslet-based parser for fixing condensed Markdown link reference definitions.
== The Problem
This class fixes corrupted Markdown files where link reference definitions
that were originally on separate lines got smashed together by having their
separating newlines removed.
A previous bug in ast-merge caused link reference definitions at the bottom
of Markdown files to be merged together into a single line without newlines
or whitespace between them.
== Corruption Patterns
Two types of corruption are detected and fixed:
- Multiple definitions condensed on one line:
- Corrupted:
[label1]: url1[label2]: url2 - Fixed: Each definition on its own line
- Corrupted:
- Content followed by definition without newline:
- Corrupted:
Some text or URL[label]: url - Fixed: Newline inserted before
[label]:
- Corrupted:
== How It Works
The parser uses a PEG grammar (via Parslet) to:
- Recognize link reference definition patterns:
[label]: url - Detect when multiple definitions are on the same line
- Detect when content precedes a definition without newline separation
- Parse and reconstruct definitions with proper newlines
Why PEG? The previous regex-based implementation had potential ReDoS
(Regular Expression Denial of Service) vulnerabilities due to complex
lookahead/lookbehind patterns. PEG parsers are linear-time and immune to
ReDoS attacks.
The grammar extends the pattern from LinkParser::DefinitionGrammar but
handles the case where definitions are concatenated without separators.
Defined Under Namespace
Classes: CondensedDefinitionsGrammar
Instance Attribute Summary collapse
-
#source ⇒ String
readonly
The input text to parse.
Instance Method Summary collapse
-
#condensed? ⇒ Boolean
Check if the source contains condensed link reference definitions.
-
#count ⇒ Integer
Count the number of link reference definitions in the source.
-
#definitions ⇒ Array<Hash>
Parse the source into individual link reference definitions that are condensed.
-
#expand ⇒ String
Expand condensed link reference definitions to separate lines.
-
#initialize(source) ⇒ CondensedLinkRefs
constructor
Create a new parser for the given text.
Constructor Details
#initialize(source) ⇒ CondensedLinkRefs
Create a new parser for the given text.
163 164 165 166 167 168 |
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 163 def initialize(source) @source = source.to_s @grammar = CondensedDefinitionsGrammar.new @parsed = nil @definitions = nil end |
Instance Attribute Details
#source ⇒ String (readonly)
Returns the input text to parse.
158 159 160 |
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 158 def source @source end |
Instance Method Details
#condensed? ⇒ Boolean
Check if the source contains condensed link reference definitions.
Detects patterns where link definitions are not properly separated:
- Multiple link defs on same line:
[l1]: url1[l2]: url2 - Content followed by link def without newline:
text[label]: url
Uses the PEG grammar to parse and detect condensed sequences.
179 180 181 182 183 184 185 186 187 188 189 |
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 179 def condensed? source.each_line do |line| # Pattern 1: Line contains 2+ link definitions (condensed together) return true if contains_multiple_definitions?(line) # Pattern 2: Line has content before first link definition # (indicates corruption where newline before def was removed) return true if has_content_before_definition?(line) end false end |
#count ⇒ Integer
Count the number of link reference definitions in the source.
249 250 251 |
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 249 def count definitions.size end |
#definitions ⇒ Array<Hash>
Parse the source into individual link reference definitions that are condensed.
This finds link refs that are part of corrupted patterns:
- Multiple refs on same line without newlines
- Content followed by ref without newline
Uses the PEG grammar to properly parse link definitions.
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 |
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 200 def definitions return @definitions if @definitions @definitions = [] # Find all condensed sequences line by line source.each_line do |line| # Try to parse as definitions parsed = parse_line(line) next unless parsed && !parsed.empty? # Check if line has content before first definition first_bracket = line.index("[") has_prefix = first_bracket && first_bracket > 0 && !line[0...first_bracket].strip.empty? # Include if: multiple definitions OR single definition with prefix next unless parsed.size > 1 || has_prefix # Extract definition info from parse tree parsed.each do |def_tree| @definitions << extract_definition(def_tree) end end @definitions end |
#expand ⇒ String
Expand condensed link reference definitions to separate lines.
Fixes only the condensed patterns (where a URL is immediately followed
by a new link ref definition without a newline). All other content
is preserved exactly as-is.
Uses the PEG grammar to properly parse and reconstruct definitions.
236 237 238 239 240 241 242 243 244 |
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 236 def return source unless condensed? lines = source.lines.map do |line| (line) end lines.join end |