Add demo content from a website such as Wikipedia
Summary
This sample shows how you can generate a set of demo content from a website such as Wikipedia. This is intended to build content for testing, especially for systems such as Viva Topics. The content may not render perfectly but build out a load of documents in the environment.
Implementation
Install Pandoc using the instructions on their site.
Open Windows Powershell ISE
Navigate to the script folder
Run the command and have some patience (it takes a while for a lot of docs) - examples are in the script below
<#
.SYNOPSIS
The script used Pandoc to generate Word files from a set of html files.
.DESCRIPTION
Have you ever needed to have a large amount of documents but didn't just want to duplicate the same thing?
This is especially useful for generating content to try with Viva Topics.
The script will crawl through a website defined in the properties, extract the HTML, save that as a Word document using Pandoc and upload that to a defined SharePoint library.
Properties allow you to define how many files to generate and how many layers to go.
Pre-requisites:
- Pandoc must be installed - https://pandoc.org/installing.html
.EXAMPLE
.\GenerateDocs.ps1 -TargetUrl https://contoso.sharepoint.com/sites/DemoContent-SharePoint -WebUrl "https://www.mcd79.com"
.EXAMPLE
.\GenerateDocs.ps1 -TargetUrl https://contoso.sharepoint.com/sites/DemoContent-SharePoint -TargetLibrary "Content" -WebUrl "https://www.mcd79.com"
.EXAMPLE
.\GenerateDocs.ps1 -TargetUrl https://contoso.sharepoint.com/sites/DemoContent-SharePoint -WebUrl "https://www.wikipedia.org" -WebExtension "wiki/SharePoint"
.EXAMPLE
.\GenerateDocs.ps1 -TargetUrl https://contoso.sharepoint.com/sites/DemoContent-SharePoint -WebUrl "https://www.wikipedia.org" -WebExtension "wiki/SharePoint" -maxLinks 5000 -maxLevels 5
.PARAMETER WebUrl
The base URL of the webpage to be inventoried e.g. https://www.wikipedia.org
.PARAMETER WebExtension
An optional parameter to define a specific page to start from e.g. wiki/SharePoint
.PARAMETER TargetUrl
The URL of the SharePoint Online site where the content should be loaded.
.PARAMETER TargetLibrary
The URL of the SharePoint Online site where the content should be loaded.
.PARAMETER MaxLinks
The maximum number of links to navigate through
.PARAMETER MaxLevels
The maximum number of levels to traverse
#>
[CmdletBinding()]
param(
[parameter(Mandatory = $true)][string]$WebUrl,
[parameter(Mandatory = $false)][string]$WebExtension = "",
[parameter(Mandatory = $true)][string]$TargetUrl,
[parameter(Mandatory = $false)][string]$TargetLibrary = "Shared Documents",
[parameter(Mandatory = $false)][int]$maxLinks = 200,
[parameter(Mandatory = $false)][int]$maxLevels = 5
)
$Script:maxLinks = $maxLinks
$Script:maxLevels = $maxLevels
$Script:numberLinks = 0
$Script:linksVisited = @()
Function CrawlLink($site, $level) {
Try {
Write-Output "Crawling $site"
$request = Invoke-WebRequest $site
$content = $request.Content
$pattern = '[\\/]'
$htmlName = $site -replace 'http://', ''
$htmlName = $htmlName -replace 'https://', ''
$htmlName = $htmlName -replace $pattern, '-'
$outputFile = "./outputs/$htmlName.html"
$outputDoc = "./outputs/$htmlName.docx"
$content | Out-File -Path $outputFile
$exe = "pandoc.exe"
&$exe $outputFile -o $outputDoc -f html -t docx
Add-PnPFile -Path $outputDoc -Folder $targetLibrary
Remove-Item -Path $outputFile
Remove-Item -Path $outputDoc
#pandoc -s -S $outputFile -o $outputDoc
$domain = ($site.Replace("http://", "").Replace("https://", "")).Split('/')[0]
$start = 0
$end = 0
$start = $content.IndexOf("<a ", $end)
while ($start -ge 0) {
if ($start -ge 0) {
#Write-Output $start
# Get the position of of the beginning of the link. The +6 is to go past the href="
$start = $content.IndexOf("href=", $start) + 6
if ($start -ge 6) {
$end = $content.IndexOf("""", $start)
$end2 = $content.IndexOf("'", $start)
if ($end2 -lt $end -and $end2 -ne -1) {
$end = $end2
}
if ($end -ge $start) {
$link = $content.Substring($start, $end – $start)
# Handle case where link is relative
if ($link.StartsWith("/")) {
$link = $site.Split('/')[0] + "//" + $domain + $link
}
if ($Script:numberLinks -le $Script:maxLinks -and $level -le $Script:maxLevels) {
if (($Script:linksVisited -notcontains $link) -and $link.StartsWith("https:")) {
$Script:numberLinks++
$newDomain = ($link.Replace("http://", "").Replace("https://", "")).Split('/')[0]
Write-Output "$newDomain - $WebUrl"
if ($newDomain -eq $WebUrl.Replace("http://", "").Replace("https://", "")) {
#Write-Output $Script:numberLinks"["$level"] – "$link -BackgroundColor Blue -ForegroundColor White
$Script:linksVisited += $link
CrawlLink $link ([int]($level + 1))
}
}
}
}
}
}
$start = $content.IndexOf("<a ", $end)
}
}
Catch [system.exception] {
Write-Output "ERROR: $_"
}
}
Connect-PnPOnline $TargetUrl -Interactive
CrawlLink "$WebUrl/$extension" 0
Check out the PnP PowerShell to learn more at: https://aka.ms/pnp/powershell
The way you login into PnP PowerShell has changed please read PnP Management Shell EntraID app is deleted : what should I do ?
Contributors
Author(s) |
---|
Kevin McDonnell |
Disclaimer
THESE SAMPLES ARE PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING ANY IMPLIED WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR NON-INFRINGEMENT.