Class: Markdown::Merge::Cleanse::CondensedLinkRefs

Inherits:
Object
  • Object
show all
Defined in:
lib/markdown/merge/cleanse/condensed_link_refs.rb

Overview

Parslet-based parser for fixing condensed Markdown link reference definitions.

== The Problem

This class fixes corrupted Markdown files where link reference definitions
that were originally on separate lines got smashed together by having their
separating newlines removed.

A previous bug in ast-merge caused link reference definitions at the bottom
of Markdown files to be merged together into a single line without newlines
or whitespace between them.

== Corruption Patterns

Two types of corruption are detected and fixed:

  1. Multiple definitions condensed on one line:
    • Corrupted: [label1]: url1[label2]: url2
    • Fixed: Each definition on its own line
  2. Content followed by definition without newline:
    • Corrupted: Some text or URL[label]: url
    • Fixed: Newline inserted before [label]:

== How It Works

The parser uses a PEG grammar (via Parslet) to:

  • Recognize link reference definition patterns: [label]: url
  • Detect when multiple definitions are on the same line
  • Detect when content precedes a definition without newline separation
  • Parse and reconstruct definitions with proper newlines

Why PEG? The previous regex-based implementation had potential ReDoS
(Regular Expression Denial of Service) vulnerabilities due to complex
lookahead/lookbehind patterns. PEG parsers are linear-time and immune to
ReDoS attacks.

The grammar extends the pattern from LinkParser::DefinitionGrammar but
handles the case where definitions are concatenated without separators.

Examples:

Condensed definitions (Pattern 1)

# Before (corrupted):
"[⛳liberapay-img]: https://example.com/img.svg[⛳liberapay]: https://example.com"

# After (fixed):
"[⛳liberapay-img]: https://example.com/img.svg\n[⛳liberapay]: https://example.com"

Content before definition (Pattern 2)

# Before (corrupted):
"https://donate.codeberg.org/[🤝contributing]: CONTRIBUTING.md"

# After (fixed):
"https://donate.codeberg.org/\n[🤝contributing]: CONTRIBUTING.md"

Basic usage

parser = Markdown::Merge::Cleanse::CondensedLinkRefs.new(condensed_text)
fixed_text = parser.expand

Check if text contains condensed refs

parser = Markdown::Merge::Cleanse::CondensedLinkRefs.new(text)
parser.condensed? # => true/false

Process a file

content = File.read("README.md")
parser = Markdown::Merge::Cleanse::CondensedLinkRefs.new(content)
if parser.condensed?
  File.write("README.md", parser.expand)
end

Get parsed definitions

parser = Markdown::Merge::Cleanse::CondensedLinkRefs.new(condensed_text)
parser.definitions.each do |defn|
  puts "#{defn[:label]} => #{defn[:url]}"
end

See Also:

Defined Under Namespace

Classes: CondensedDefinitionsGrammar

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source) ⇒ CondensedLinkRefs

Create a new parser for the given text.

Parameters:

  • source (String)

    the text that may contain condensed link refs



163
164
165
166
167
168
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 163

def initialize(source)
  @source = source.to_s
  @grammar = CondensedDefinitionsGrammar.new
  @parsed = nil
  @definitions = nil
end

Instance Attribute Details

#sourceString (readonly)

Returns the input text to parse.

Returns:

  • (String)

    the input text to parse



158
159
160
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 158

def source
  @source
end

Instance Method Details

#condensed?Boolean

Check if the source contains condensed link reference definitions.

Detects patterns where link definitions are not properly separated:

  1. Multiple link defs on same line: [l1]: url1[l2]: url2
  2. Content followed by link def without newline: text[label]: url

Uses the PEG grammar to parse and detect condensed sequences.

Returns:

  • (Boolean)

    true if condensed refs are detected



179
180
181
182
183
184
185
186
187
188
189
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 179

def condensed?
  source.each_line do |line|
    # Pattern 1: Line contains 2+ link definitions (condensed together)
    return true if contains_multiple_definitions?(line)

    # Pattern 2: Line has content before first link definition
    # (indicates corruption where newline before def was removed)
    return true if has_content_before_definition?(line)
  end
  false
end

#countInteger

Count the number of link reference definitions in the source.

Returns:

  • (Integer)

    number of link ref definitions found



249
250
251
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 249

def count
  definitions.size
end

#definitionsArray<Hash>

Parse the source into individual link reference definitions that are condensed.

This finds link refs that are part of corrupted patterns:

  1. Multiple refs on same line without newlines
  2. Content followed by ref without newline

Uses the PEG grammar to properly parse link definitions.

Returns:

  • (Array<Hash>)

    Array of { label:, url:, title: (optional) }



200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 200

def definitions
  return @definitions if @definitions

  @definitions = []

  # Find all condensed sequences line by line
  source.each_line do |line|
    # Try to parse as definitions
    parsed = parse_line(line)
    next unless parsed && !parsed.empty?

    # Check if line has content before first definition
    first_bracket = line.index("[")
    has_prefix = first_bracket && first_bracket > 0 && !line[0...first_bracket].strip.empty?

    # Include if: multiple definitions OR single definition with prefix
    next unless parsed.size > 1 || has_prefix

    # Extract definition info from parse tree
    parsed.each do |def_tree|
      @definitions << extract_definition(def_tree)
    end
  end

  @definitions
end

#expandString

Expand condensed link reference definitions to separate lines.

Fixes only the condensed patterns (where a URL is immediately followed
by a new link ref definition without a newline). All other content
is preserved exactly as-is.

Uses the PEG grammar to properly parse and reconstruct definitions.

Returns:

  • (String)

    the source with condensed link refs expanded to separate lines



236
237
238
239
240
241
242
243
244
# File 'lib/markdown/merge/cleanse/condensed_link_refs.rb', line 236

def expand
  return source unless condensed?

  lines = source.lines.map do |line|
    expand_line(line)
  end

  lines.join
end