1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 | 1× 1× 1× 1× 1× 1× 1× 410× 410× 1× 409× 1× 408× 408× 408× 408× 408× 408× 408× 408× 408× 408× 408× 408× 408× 408× 4375× 4375× 1× 1× 4374× 4375× 1× 4374× 1× 4373× 12× 4373× 4373× 1× 4372× 404× 4372× 4372× 1196× 3176× 405× 1× 12× 119× 22× 119× 119× 3× 66× 66× 66× 66× 66× 231× 231× 231× 231× 141× 141× 141× 141× 141× 141× 582× 90× 4× 86× 11× 11× 152× 152× 66× 49× 49× 49× 29× 49× 49× 49× 48× 48× 48× 48× 1× 47× 1× 46× 48× 48× 1× 48× 48× 5× 48× 22× 22× 48× 189× 189× 189× 189× 189× 189× 189× 189× 189× 189× 189× 189× 189× 189× 1× 1× 1× 1× 1× 1× 1× 1× 1× 1× 817× 817× 817× 817× 831× 831× 806× 119× 119× 119× 1× 1× 1× 1× 1× 1× 1× 1× 1× 1× 1× 1× 831× 831× 831× 831× 908× 908× 908× 66× 908× 1645× 1648× 1648× 1648× 1× 1648× 3235× 3235× 3235× 1767× 1767× 1349× 1349× 1468× 8× 1460× 1460× 1460× 818× 1× 818× 818× 818× 642× 642× 642× 642× 642× 642× 642× 516× 516× 516× 642× 1× 641× 641× 642× 642× 642× 642× 642× 642× 642× 642× 1× 1× 1× 1× 1648× 1773× 1773× 264× 173× 1509× 1460× 49× 1682× 1682× 4801× 841× 841× 1682× 4801× 11× 11× 11× 11× 33× 33× 33× 10× 10× 23× 11× 11× 11× 11× 11× 11× 11× 11× 11× 11× 55× 11× 1767× 1767× 1767× 1767× 1767× 448× 1319× 1767× 1767× 1767× 1767× 408× 527× 527× 527× 527× 527× 527× 527× 527× 527× 527× 527× 527× 527× 716× 716× 1× 1× | import { last, createCountingIdGenerator, ArrayIterator } from '../util' const WS_LEFT = /^\s+/g const WS_LEFT_ALL = /^\s*/g const WS_RIGHT = /\s+$/g const WS_ALL = /\s+/g // var ALL_WS_NOTSPACE_LEFT = /^[\t\n]+/g // var ALL_WS_NOTSPACE_RIGHT = /[\t\n]+$/g const SPACE = " " const TABS_OR_NL = /[\t\n\r]+/g const INVISIBLE_CHARACTER = "\u200B" /** A generic base implementation for XML/HTML importers. @param {Object} config @param {DocumentSchema} config.schema @param {object[]} config.converters */ export default class DOMImporter { constructor(config, context) { this.context = context || {} if (!config.schema) { throw new Error('"config.schema" is mandatory') } if (!config.converters) { throw new Error('"config.converters" is mandatory') } this.config = Object.assign({ idAttribute: 'id' }, config) this.schema = config.schema this.converters = config.converters this.state = null this._defaultBlockConverter = null this._allConverters = [] this._blockConverters = [] this._propertyAnnotationConverters = [] this.state = new DOMImporter.State() this._initialize() } /* Goes through all converters, checks their consistency and registers them depending on the type in different sets. */ _initialize() { const schema = this.schema const defaultTextType = schema.getDefaultTextType() const converters = this.converters for (let i = 0; i < converters.length; i++) { let converter if (typeof converters[i] === 'function') { const Converter = converters[i] converter = new Converter() } else { converter = converters[i] } if (!converter.type) { throw new Error('Converter must provide the type of the associated node.') } if (!converter.matchElement && !converter.tagName) { throw new Error('Converter must provide a matchElement function or a tagName property.') } if (!converter.matchElement) { converter.matchElement = this._defaultElementMatcher.bind(converter) } const NodeClass = schema.getNodeClass(converter.type) if (!NodeClass) { throw new Error('No node type defined for converter') } if (!this._defaultBlockConverter && defaultTextType === converter.type) { this._defaultBlockConverter = converter } this._allConverters.push(converter) // Defaults to _blockConverters if (NodeClass.prototype._isPropertyAnnotation) { this._propertyAnnotationConverters.push(converter) } else { this._blockConverters.push(converter) } } if (!this._defaultBlockConverter) { throw new Error(`No converter for defaultTextType ${defaultTextType}`) } } dispose() { Iif (this.state.doc) { this.state.doc.dispose() } } /** Resets this importer. Make sure to either create a new importer instance or call this method when you want to generate nodes belonging to different documents. */ reset() { if (this.state.doc) { this.state.doc.dispose() } this.state.reset() this.state.doc = this._createDocument() } getDocument() { return this.state.doc } /** Converts all children of a given element and creates a Container node. @param {DOMElement[]} elements All elements that should be converted into the container. @param {String} containerId The id of the target container node. @returns {Container} the container node */ convertContainer(elements, containerId) { if (!this.state.doc) this.reset() const state = this.state const iterator = new ArrayIterator(elements) const nodeIds = [] while(iterator.hasNext()) { const el = iterator.next() let node const blockTypeConverter = this._getConverterForElement(el, 'block') if (blockTypeConverter) { state.pushContext(el.tagName, blockTypeConverter) let nodeData = this._createNodeData(el, blockTypeConverter.type) nodeData = blockTypeConverter.import(el, nodeData, this) || nodeData node = this._createNode(nodeData) let context = state.popContext() context.annos.forEach((a) => { this._createNode(a) }) } else if (el.isCommentNode()) { continue } else { // skip empty text nodes if (el.isTextNode() && /^\s*$/.exec(el.textContent)) continue // If we find text nodes on the block level we wrap // it into a paragraph element (or what is configured as default block level element) iterator.back() node = this._wrapInlineElementsIntoBlockElement(iterator) } Eif (node) { nodeIds.push(node.id) } } return this._createNode({ type: 'container', id: containerId, nodes: nodeIds }) } /** Converts a single HTML element and creates a node in the current document. @param {ui/DOMElement} el the HTML element @returns {object} the created node as JSON */ convertElement(el) { if (!this.state.doc) this.reset() let isTopLevel = !this.state.isConverting if (isTopLevel) { this.state.isConverting = true } let nodeData, annos const converter = this._getConverterForElement(el) if (converter) { const NodeClass = this.schema.getNodeClass(converter.type) nodeData = this._createNodeData(el, converter.type) this.state.pushContext(el.tagName, converter) // Note: special treatment for property annotations and inline nodes // i.e. if someone calls `importer.convertElement(annoEl)` // usually, annotations are imported in the course of `importer.annotatedText(..)` // The peculiarity here is that in such a case, it is not // not clear, which property the annotation is attached to if (NodeClass.isInline) { nodeData = this._convertInlineNode(el, nodeData, converter) } else if (NodeClass.prototype._isPropertyAnnotation) { nodeData = this._convertPropertyAnnotation(el, nodeData) } else { nodeData = converter.import(el, nodeData, this) || nodeData } let context = this.state.popContext() annos = context.annos } else { throw new Error('No converter found for '+el.tagName) } // create the node const node = this._createNode(nodeData) // and all annos which have been created during this call annos.forEach((a) => { this._createNode(a) }) // HACK: to allow using an importer stand-alone // i.e. creating detached elements if (this.config["stand-alone"] && isTopLevel) { this.state.isConverting = false this.reset() } return node } /** Convert annotated text. You should call this method only for elements containing rich-text. @param {DOMElement} el @param {String[]} path The target property where the extracted text (plus annotations) should be stored. @param {Object} options @param {Boolean} options.preserveWhitespace when true will preserve whitespace. Default: false. @returns {String} The converted text as plain-text @example ``` p.content = converter.annotatedText(pEl, [p.id, 'content']) ``` */ annotatedText(el, path, options={}) { Iif (!path) { throw new Error('path is mandatory') } const state = this.state const context = last(state.contexts) // NOTE: this API is meant for node converters, which have been triggered // via convertElement(). Iif (!context) { throw new Error('This should be called from within an element converter.') } // TODO: are there more options? const oldPreserveWhitespace = state.preserveWhitespace Iif (options.preserveWhitespace) { state.preserveWhitespace = true } state.stack.push({ path: path, offset: 0, text: "", annos: []}) // IMO we should reset the last char, as it is only relevant within one // annotated text property. This feature is mainly used to eat up // whitespace in XML/HTML at tag boundaries, produced by pretty-printed XML/HTML. this.state.lastChar = '' const iterator = el.getChildNodeIterator() const text = this._annotatedText(iterator) // now we can create all annotations which have been created during this // call of annotatedText const top = state.stack.pop() context.annos = context.annos.concat(top.annos) // reset state state.preserveWhitespace = oldPreserveWhitespace return text } /** Converts the given element as plain-text. @param {ui/DOMElement} el @returns {String} The plain text */ plainText(el) { var state = this.state var text = el.textContent Iif (state.stack.length > 0) { var context = last(state.stack) context.offset += text.length context.text += context.text.concat(text) } return text } /* Tells the converter to insert custom text. During conversion of annotatedText this is used to insert different text than taken from the DOM. E.g., for inline nodes we insert an invisible character instead of the inner content. @private @param {String} */ _customText(text) { var state = this.state Eif (state.stack.length > 0) { var context = last(state.stack) context.offset += text.length context.text += context.text.concat(text) } return text } /** Generates an id. The generated id is unique with respect to all ids generated so far. @param {String} a prefix @return {String} the generated id */ nextId(prefix) { // TODO: we could create more beautiful ids? // however we would need to be careful as there might be another // element in the HTML coming with that id // For now we use shas return this.state.uuid(prefix) } _getNextId(dom, type) { let id = this.nextId(type) while (this.state.ids[id] || dom.find('#'+id)) { id = this.nextId(type) } return id } _getIdForElement(el, type) { let id = el.getAttribute(this.config.idAttribute) if (id && !this.state.ids[id]) return id return this._getNextId(el.getOwnerDocument(), type) } // Note: this is e.g. shared by ClipboardImporter which has a different // implementation of this.createDocument() _createDocument() { // create an empty document and initialize the container if not present const schema = this.config.schema const DocumentClass = schema.getDocumentClass() return new DocumentClass(schema) } _convertPropertyAnnotation(el, nodeData) { const path = [nodeData.id, '_content'] // if there is no context, this is called stand-alone // i.e., user tries to convert an annotation element // directly, not part of a block element, such as a paragraph nodeData._content = this.annotatedText(el, path) nodeData.start = { path, offset: 0 } nodeData.end = { offset: nodeData._content.length } return nodeData } _convertInlineNode(el, nodeData, converter) { const path = [nodeData.id, '_content'] Eif (converter.import) { nodeData = converter.import(el, nodeData, this) || nodeData } nodeData._content = '$' nodeData.start = { path, offset: 0 } nodeData.end = { offset: 1 } return nodeData } _createNodeData(el, type) { Iif (!type) { throw new Error('type is mandatory.') } let nodeData = { type, id: this._getIdForElement(el, type) } this.state.ids[nodeData.id] = true return nodeData } _createNode(nodeData) { let doc = this.state.doc // NOTE: if your Document implementation adds default nodes in the constructor // and you have exported the node, we need to remove the default version first // TODO: alternatively we could just update the existing one. For now we remove the old one. let node = doc.get(nodeData.id) if (node) { // console.warn('Node with same it already exists.', node) doc.delete(node.id) } return doc.create(nodeData) } _defaultElementMatcher(el) { return el.is(this.tagName) } /* Internal function for parsing annotated text */ _annotatedText(iterator) { const state = this.state const context = last(state.stack) /* istanbul ignore next */ Iif (!context) { throw new Error('Illegal state: context is null.') } while(iterator.hasNext()) { var el = iterator.next() var text = "" /* istanbul ignore else */ // Plain text nodes... if (el.isTextNode()) { text = this._prepareText(el.textContent) if (text.length) { // Note: text is not merged into the reentrant state // so that we are able to return for this reentrant call context.text = context.text.concat(text) context.offset += text.length } } else if (el.isCommentNode()) { // skip comment nodes continue } else Eif (el.isElementNode()) { const annoConverter = this._getConverterForElement(el, 'inline') // if no inline converter is found we just traverse deeper if (!annoConverter) { /* istanbul ignore next */ Iif (!this.IGNORE_DEFAULT_WARNINGS) { console.warn('Unsupported inline element. We will not create an annotation for it, but process its children to extract annotated text.', el.outerHTML) } // this descends into children elements without introducing a new stack frame // and without creating an element. const iterator = el.getChildNodeIterator() this._annotatedText(iterator) continue } // reentrant: we delegate the conversion to the inline node class // it will either call us back (this.annotatedText) or give us a finished // node instantly (self-managed) var startOffset = context.offset const annoType = annoConverter.type const AnnoClass = this.schema.getNodeClass(annoType) let annoData = this._createNodeData(el, annoType) // push a new context so we can deal with reentrant calls let stackFrame = { path: context.path, offset: startOffset, text: "", annos: [] } state.stack.push(stackFrame) // with custom import if (annoConverter.import) { state.pushContext(el.tagName, annoConverter) annoData = annoConverter.import(el, annoData, this) || annoData state.popContext() } // As opposed to earlier implementations we do not rely on // let the content be converted by custom implementations // as they do not own the content // TODO: we should make sure to throw when the user tries to if (AnnoClass.isInline) { this._customText(INVISIBLE_CHARACTER) } else { // We call this to descent into the element // which could be 'forgotten' otherwise. // TODO: what if the converter has processed the element already? const iterator = el.getChildNodeIterator() this._annotatedText(iterator) } // ... and transfer the result into the current context state.stack.pop() context.offset = stackFrame.offset context.text = context.text.concat(stackFrame.text) // in the mean time the offset will probably have changed to reentrant calls const endOffset = context.offset annoData.start = { path: context.path.slice(0), offset: startOffset } annoData.end = { offset: endOffset } // merge annos into parent stack frame let parentFrame = last(state.stack) parentFrame.annos = parentFrame.annos.concat(stackFrame.annos, annoData) } else { console.warn('Unknown element type. Taking plain text.', el.outerHTML) text = this._prepareText(el.textContent) context.text = context.text.concat(text) context.offset += text.length } } // return the plain text collected during this reentrant call return context.text } _getConverterForElement(el, mode) { var converters if (mode === "block") { if (!el.tagName) return null converters = this._blockConverters } else if (mode === "inline") { converters = this._propertyAnnotationConverters } else { converters = this._allConverters } var converter = null for (var i = 0; i < converters.length; i++) { if (this._converterCanBeApplied(converters[i], el)) { converter = converters[i] break } } return converter } _converterCanBeApplied(converter, el) { return converter.matchElement(el, this) } /* Wraps the remaining (inline) elements of a node iterator into a default block node. @param {DOMImporter.ChildIterator} childIterator @returns {object} node data */ _wrapInlineElementsIntoBlockElement(childIterator) { Iif (!childIterator.hasNext()) return let dom = childIterator.peek().getOwnerDocument() let wrapper = dom.createElement('wrapper') while(childIterator.hasNext()) { const el = childIterator.next() // if there is a block node we finish this wrapper const blockTypeConverter = this._getConverterForElement(el, 'block') if (blockTypeConverter) { childIterator.back() break } wrapper.append(el.clone()) } const type = this.schema.getDefaultTextType() const id = this._getNextId(dom, type) const converter = this._defaultBlockConverter let nodeData = { type, id } this.state.pushContext('wrapper', converter) nodeData = converter.import(wrapper, nodeData, this) || nodeData let context = this.state.popContext() let annos = context.annos // create the node const node = this._createNode(nodeData) // and all annos which have been created during this call annos.forEach((a) => { this._createNode(a) }) return node } // TODO: this needs to be tested and documented _prepareText(text) { const state = this.state Iif (state.preserveWhitespace) { return text } var repl = SPACE // replace multiple tabs and new-lines by one space text = text.replace(TABS_OR_NL, '') // TODO: the last char handling is only necessary for for nested calls // i.e., when processing the content of an annotation, for instance // we need to work out how we could control this with an inner state if (state.lastChar === SPACE) { text = text.replace(WS_LEFT_ALL, repl) } else { text = text.replace(WS_LEFT, repl) } text = text.replace(WS_RIGHT, repl) // EXPERIMENTAL: also remove white-space within // this happens if somebody treats the text more like it would be done in Markdown // i.e. introducing line-breaks Iif (this.config.REMOVE_INNER_WS || state.removeInnerWhitespace) { text = text.replace(WS_ALL, SPACE) } state.lastChar = text[text.length-1] || state.lastChar return text } /* Removes any leading and trailing whitespaces from the content within the given element. Attention: this is not yet implemented fully. Atm, trimming is only done on the first and last text node (if they exist). */ _trimTextContent(el) { var nodes = el.getChildNodes() var firstNode = nodes[0] var lastNode = last(nodes) var text, trimmed // trim the first and last text if (firstNode && firstNode.isTextNode()) { text = firstNode.textContent trimmed = this._trimLeft(text) firstNode.textContent = trimmed } if (lastNode && lastNode.isTextNode()) { text = lastNode.textContent trimmed = this._trimRight(text) lastNode.textContent = trimmed } return el } _trimLeft(text) { return text.replace(WS_LEFT, "") } _trimRight(text) { return text.replace(WS_RIGHT, "") } } class DOMImporterState { constructor() { this.reset() } reset() { this.preserveWhitespace = false this.nodes = [] this.annotations = [] this.containerId = null this.container = [] this.ids = {} // stack for reentrant calls into convertElement() this.contexts = [] // stack for reentrant calls into annotatedText() this.stack = [] this.lastChar = "" this.skipTypes = {} this.ignoreAnnotations = false this.isConverting = false // experimental: trying to generate simpler ids during import // this.uuid = uuid this.uuid = createCountingIdGenerator() } pushContext(tagName, converter) { this.contexts.push({ tagName: tagName, converter: converter, annos: []}) } popContext() { return this.contexts.pop() } getCurrentContext() { return last(this.contexts) } } DOMImporter.State = DOMImporterState DOMImporter.INVISIBLE_CHARACTER = INVISIBLE_CHARACTER |