Ruby, like Perl, supports the idea of a “here document” (heredoc). This basically gives you an easy way of writing large multiline strings that start and are terminated by an arbitrary string delimiter. In Ruby heredocs also support string interpolation, much like a double quoted string. Heredocs are often useful for several reason, not least of all when used along with metaprogramming features such as module_eval to embed the source for the Ruby source with normal formatting in the file.
Ripper — A Ruby Parser
Ripper provides a SAX-like, event-based protocol for parsing Ruby. It divides the events into scanner events and parser events. Roughly speaking the scanner events correspond to the lexical analysis steps, while the parser events correspond to the top level parser events. This provides a fairly easy way to write a simple parser, and we are using Ripper as the basis for our Ruby parser which generates a parse tree using our RubyWrite tool to generate the nodes. Unfortunately Ripper, a parser included as an extension with Ruby 1.9.x, does not properly support heredocs in its current implementation.
Ripper and Heredocs
Ripper, generates all the appropriate scanner events for the heredoc, but ultimately looses the last part of the string, just before the heredoc end event, because it overwrites the YACC value that contained the string before it gets the chance to make it to the parser. I think the problem partially stems from the fact that the lexer function treats heredocs specially, handling the processing of them out of the parser, and only dipping into the parser to process embedded strings. The problem with this approach is that the parser is not aware of the heredoc_beg and heredoc_end events generated in the Ripper scanner. In the case of the heredoc_beg this is fine, since the parser is just generating an event for the finished string, but it does mean that dispatching the heredoc_end event accidently wipes out the string content token that the parser was actually expecting here.
The bug can be demonstrated pretty easily using either Ripper’s built-in s-expression parser or our RubyWriteRuby parser:
$ irb -Ilib -rruby_write_ruby
>> Ripper.sexp("<<-EOF\nThis is a\ntest of heredocs\nEOF")
=> [:program, [[:string_literal, [:string_content, [:heredoc_end, "EOF", [4, 0]]]]]]
>> RubyWriteRuby.parse("<<-EOF\nThis is a\ntest of heredocs\nEOF").to_string
=> ":Program[[:StringLiteral[["EOF"]]]]"
I wanted to find a simple fix to this, rather then changing how heredocs are handled between the Ruby lexer and parser, I made a small change to the parse.y file to allow the heredoc_end event to be issued without having it overwrite final YACC value, so that the lexer is sending the correct information to the parser. I have added Bug 1921 to the Ruby redmine project with my patch. Hopefully this will be sufficient to patch the problem until a better fix can be written.
Instead of the heredoc_end, we get the expected tstring_content. This is done without changing the scanner events dispatched into Ripper, so it is still possible to capture the heredoc_beg, tstring_content, and heredoc_end scanner events in the correct order. Running the test with my patch:
$ irb -Ilib -rruby_write_ruby
>> Ripper.sexp("<<-EOF\nThis is a\ntest of heredocs\nEOF")
=> [:program, [[:string_literal, [:string_content,
[:tstring_content, "This is a\ntest of heredocs\n", [2, 0]]]]]]
>> RubyWriteRuby.parse("<<-EOF\nThis is a\ntest of heredocs\nEOF").to_string
=> ":Program[[:StringLiteral[["This is a\\ntest of heredocs\\n"]]]]"