A (supercharged) port of the Monaco Editor's Monarch syntax highlighter to CodeMirror 6.
npm install cm6-monarchcm6-monarchparser syntax tree additions, and the usage of sticky regexes in the backend, allowing for full lookbehind support.
docs/syntax.md file. There is some significant additions, although basically only non-breaking ones, like lookbehind support.
docs/syntax.md file (at the bottom) to learn more. The nesting is done as a parser directive, and what I mean is that the parser will actually create a syntax tree from your parser directives. You can use this to add code folding and all kinds of other syntax node related features.
ts
import { createMonarchLanguage } from 'cm6-monarch'
// Just for reference, this is the configuration interface.
interface MonarchLanguageDefinition {
/* The name of the language. This is actually important for CodeMirror, so make sure it's correct. /
name: string
/* The Monarch lexer that will be used to tokenize the language. /
lexer: IMonarchLanguage
/* A list of LanguageDescription objects that will be used when the parser nests in a language. /
nestLanguages?: LanguageDescription[]
/* Configuration options for the parser, such as node props. /
configure?: MonarchConfigure
/* A list of aliases for the name of the language. (e.g. 'go' -> ['golang']) /
alias?: string[]
/* A list of file extensions. (e.g. ['.ts']) /
ext?: string[]
/* The 'languageData' field for the language. CodeMirror plugins use this data to interact with the language. /
languageData?: { [name: string]: any }
/* Extra extensions to be loaded. /
extraExtensions?: Extension[]
}
// And what it returns:
interface MonarchLanguageData {
/* Creates a LanguageSupport object that can be used like an ordinary language/extension. /
load(): LanguageSupport
/* A list of new Tag objects generated automatically from the language definition. /
tags: { [name: string]: Tag }
/* A LanguageDescription object, commonly used for nesting languages. /
description: LanguageDescription
}
// This uses the lexer given in the Monarch tutorial.
// const { load, tags, description } = createMonarchLanguage({
const myLanguage = createMonarchLanguage({
name: 'myLang',
lexer: {
// defaultToken: 'invalid',
keywords: [
'abstract', 'continue', 'for', 'new', 'switch', 'assert', 'goto', 'do',
'if', 'private', 'this', 'break', 'protected', 'throw', 'else', 'public',
'enum', 'return', 'catch', 'try', 'interface', 'static', 'class',
'finally', 'const', 'super', 'while', 'true', 'false'
],
typeKeywords: [
'boolean', 'double', 'byte', 'int', 'short', 'char', 'void', 'long', 'float'
],
operators: [
'=', '>', '<', '!', '~', '?', ':', '==', '<=', '>=', '!=',
'&&', '||', '++', '--', '+', '-', '*', '/', '&', '|', '^', '%',
'<<', '>>', '>>>', '+=', '-=', '*=', '/=', '&=', '|=', '^=',
'%=', '<<=', '>>=', '>>>='
],
symbols: /[=> escapes: /\\(?:[abfnrtv\\"']|x[0-9A-Fa-f]{1,4}|u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})/,
tokenizer: {
root: [
// identifiers and keywords
[/[a-z_$][\w$]*/, { cases: {'@typeKeywords': 'keyword',
'@keywords': 'keyword',
'@default': 'identifier' } }],
[/[A-Z][\w\$]*/, 'type.identifier' ], // to show class names nicely
// whitespace
{ include: '@whitespace' },
// delimiters and operators
[/[{}()\[\]]/, '@brackets'],
/[<>/, '@brackets'],
[/@symbols/, { cases: { '@operators': 'operator',
'@default' : '' } } ],
// @ annotations.
// As an example, we emit a debugging log message on these tokens.
// Note: message are supressed during the first load -- change some lines to see them.
[/@\s[a-zA-Z_\$][\w\$]/, { token: 'annotation', log: 'annotation token: $0' }],
// numbers
[/\d*\.\d+([eE][\-+]?\d+)?/, 'number.float'],
[/0[xX][0-9a-fA-F]+/, 'number.hex'],
[/\d+/, 'number'],
// delimiter: after number because of .\d floats
[/[;,.]/, 'delimiter'],
// strings
[/"([^"\\]|\\.)*$/, 'string.invalid' ], // non-teminated string
[/"/, { token: 'string.quote', bracket: '@open', next: '@string' } ],
// characters
[/'[^\\']'/, 'string'],
[/(')(@escapes)(')/, ['string','string.escape','string']],
[/'/, 'string.invalid']
],
comment: [
[/[^\/*]+/, 'comment' ],
[/\/\*/, 'comment', '@push' ], // nested comment
["\\*/", 'comment', '@pop' ],
[/[\/*]/, 'comment' ]
],
string: [
[/[^\\"]+/, 'string'],
[/@escapes/, 'string.escape'],
[/\\./, 'string.escape.invalid'],
[/"/, { token: 'string.quote', bracket: '@close', next: '@pop' } ]
],
whitespace: [
[/[ \t\r\n]+/, 'white'],
[/\/\*/, 'comment', '@comment' ],
[/\/\/.*$/, 'comment'],
],
}
}
})
// And this is how you would load it in the editor:
import {EditorState} from "@codemirror/state"
import {EditorView, keymap} from "@codemirror/view"
import {defaultKeymap} from "@codemirror/commands"
import {defaultHighlightStyle} from '@codemirror/highlight'
const startState = EditorState.create({
doc: "Hello World",
extensions: [
keymap.of(defaultKeymap),
defaultHighlightStyle,
myLanguage.load()
]
})
const view = new EditorView({
state: startState,
parent: document.body
})
// Note that description is also exported by the creation function.
// The description object is a LanguageDescription,
// which are most commonly used to load nested grammars.
// The lang-markdown language supports these, so just to show how that works:
import { markdown } from '@codemirror/lang-markdown'
const myBetterStartState = EditorState.create({
doc: "Hello World",
extensions: [
keymap.of(defaultKeymap),
defaultHighlightStyle,
markdown({ codeLanguages: [myLanguage.description] })
]
})
// You could of course do this the other way around,
// and nest LanguageDescriptions in the language you created.
`
---
Why?
!The dark lord cometh
!it gets worse
----
God, I ask myself that every day.
Anyways, it's because Monarch uses _regex_, and this particular version of it supports stupidly flexible regex. If you're a regex wizard, you'll like this.
To be more specific about _why_, we'll start with a comparison to Lezer, which is the parser CodeMirror 6 normally uses. Lezer is a proper parser, capable of outputting beautiful syntax trees with stunningly simple grammar definitions. Seriously - go look at the official Lezer grammar for json` files - it's really tiny and simple to understand. This parser can't do that.